Skip to content

ModelServe huggingfaceserve Serve

The serve action deploys HuggingFace ML models as services on Kubernetes. A Task is created by calling run() on the Function; task parameters are passed through that call.

Overview

The huggingfaceserve function kind supports deploying HuggingFace models as REST API services. It supports various model formats and tasks including text generation, classification, and embedding.

Quick example

function = dh.new_function(
    name="my-huggingface-service",
    kind="huggingfaceserve",
    path="s3://my-bucket/path-to-model"
)

run = function.run(
    action="serve",
    replicas=1,
    huggingface_task="text_generation"
)

Parameters

Function Parameters

Must be specified when creating the function.

Name Type Description
project str Project name. Required only when creating from the library; otherwise MUST NOT be set.
name str Name that identifies the object. Required.
kind str Function kind. Must be huggingfaceserve. Required.
uuid str Object ID in UUID4 format.
description str Description of the object.
labels list[str] List of labels.
embedded bool Whether the object should be embedded in the project.
path str Path to the model files. Required.
model_name str Name of the model.
image str Docker image where to serve the model.

Model Path

The model path must follow the pattern:

path_regex = (
    r"^(store://([^/]+)/model/huggingface/.*)"
    + r"|"
    + r".*\\/$"
    + r"|"
    + r".*\\.zip$"
    + r"|"
    + r"^huggingface?://.*$"
    + r"|"
    + r"^hf?://.*$"
)

Model Image

Model image must follow the pattern:

image_regex = r"^kserve\\/huggingfaceserver?:"

Task Parameters

Can only be specified when calling function.run().

Name Type Description
action str Task action. Required. Must be serve
node_selector list[dict] Node selector.
volumes list[dict] List of volumes.
resources dict Resource limits/requests.
affinity dict Affinity configuration.
tolerations list[dict] Tolerations.
envs list[dict] Environment variables.
secrets list[str] List of secret names.
profile str Profile template.
replicas int Number of replicas.
service_type str Service type.
service_name str Service name.
huggingface_task str Huggingface task type.
backend str Backend type.
tokenizer_revision str Tokenizer revision.
max_length int Huggingface max sequence length for the tokenizer.
disable_lower_case bool Do not use lower case for the tokenizer.
disable_special_tokens bool The sequences will not be encoded with the special tokens relative to their model.
dtype str Data type to load the weights in.
trust_remote_code bool Allow loading of models and tokenizers with custom code.
tensor_input_names list[str] The tensor input names passed to the model.
return_token_type_ids bool Return token type ids.
return_probabilities bool Return all probabilities.
disable_log_requests bool Disable log requests.
max_log_len int Max number of prompt characters or prompt.

HuggingFace Task

You can specify the task type for the Huggingface model. The task type must be one of the following:

  • sequence_classification
  • token_classification
  • fill_mask
  • text_generation
  • text2text_generation
  • text_embedding

Backend

You can specify the backend type for the Huggingface model. The backend type must be one of the following:

  • AUTO
  • VLLM
  • HUGGINGFACE

Dtype

You can specify the data type to load the weights in. The data type must be one of the following:

  • AUTO
  • FLOAT32
  • FLOAT16
  • BFLOAT16
  • FLOAT
  • HALF

Run Parameters

Can only be specified when calling function.run().

No specific parameters for run of this action.

Entity methods

Run methods

Once the run is created, you can access its attributes and methods through the run object.

invoke

Invoke served model. By default it exposes infer v2 endpoint.

Parameters:

Name Type Description Default
model_name str

Name of the model.

None
method str

Method of the request.

'POST'
url str

URL of the request.

None
**kwargs dict

Keyword arguments to pass to the request.

{}

Returns:

Type Description
Response

Response from the request.