Managing LLM Models with KubeAI Runtime
To support the set of LLM scenarios within the platform it is possible to use the KubeAI runtime when the KubeAI operator is enabled.
For what concerns LLM tasks, currently KubeAI runtime allows for deploying the models for the following tasks:
TextGeneration
: text generation tasks with the OpenAI-compatible APITextEmbedding
: creating embeddings from the text following the OpenAI-compatible API
To accomplish this, it is possible to use one of the KubeAI-supported runtimes, namely vLLM, OLlama, and Infinity. in case of vLLM also adapters are supported.
For details about the specification, see the corresponding section of Modelserve reference.
Exposing Text Generation Models
To expose the text generation model, it is possible to use Core UI or Python SDK. To define the corresponding function, the following parameters should be specified:
- model name
- inference engine (one of
VLLM
,OLlama
,Infinity
) to use - model URL. Currently the model can be loaded either from HuggingFace (
hf://
prefix), from S3 storage of the platform (s3://
), or OLlama compatible model (ollama://
prefix in case of OLlama engine). - feature should be set to
TextGeneration
. - in case of vLLM engine it is also possible to add list of adapters for the main model. Each adapter is specified with its own name and URL of the corresponding adapter.
To serve the text generation model, the function should be run with the serve
action, specifying additional parameters. In particular, it may be necessary to specify the HW profile to use with number of processors (since GPU may be required) or resource specification, and further parameters and arguments accepted by the KubeAI model specification:
args
: command-line arguments to pass to the engineenv
: custom environment values (key-value pair)secrets
: project secrets to pass the values. For example, in case HuggingFace token is needed, createHF_TOKEN
secret at the project with the token value to use.files
: extra file specifications for the deploymentscaling
: scaling specification as of KubeAI documentationcaching_profile
: cache profile as of KubeAI documentation.
For example to deploy a model with an adapter from HuggingFace, the following procedure may be used:
llm_function = project.new_function("llm",
kind="kubeai-text",
model_name="tinyllama-chat",
url="hf://TinyLlama/TinyLlama-1.1B-Chat-v0.3",
engine="VLLM",
features=["TextGeneration"],
adapters=[{"name": "colorist", "url": "hf://jashing/tinyllama-colorist-lora"}])
llm_run = llm_function.run(action="serve",
profile="1xa100",
args=["--enable-prefix-caching", "--max-model-len=8192"])
Once deployed, the model is available and it is possible to call the OpenAI-compatible API from within the platform.
The run status (see openai
and service
section) contains the information about the name of the model and the endpoints
of the KubeAI API exposed
import requests
model_name = f"tinyllama-chat-123xyz_colorist"
input = {"prompt": "Hi", "model": model_name}
res = requests.post(f"http://{KUBEAI_ENDPOINT}/openai/v1/completions", json=input)
print(res.json())
By default, the KUBEAI_ENDPOINT
is kubeai:80
.
Model name
Please note how the model name is defined: it is composed of the name of the model as specified in the function and the random value. In case of adapter the name of adapter should be added: <model_name>-<random>_<adapter-name>
.
It is also possible to use OpenAI client for interacting with the model.
Exposing Text Embedding Models
To expose the text embedding model, it is possible to use Core UI or Python SDK. To define the corresponding function, the following parameters should be specified:
- model name
- inference engine (one of
VLLM
, orInfinity
) to use - model URL. Currently the model can be loaded either from HuggingFace (
hf://
prefix), from S3 storage of the platform (s3://
). - feature should be set to
TextEmbedding
.
To serve the text emvedding model, the function should be run with the serve
action, specifying additional parameters.
Normally embedding models do not require extra resources. However, further parameters and arguments accepted by the KubeAI model specification may be added:
args
: command-line arguments to pass to the engineenv
: custom environment values (key-value pair)secrets
: project secrets to pass the valuesfiles
: extra file specifications for the deploymentscaling
: scaling specification as of KubeAI documentationcaching_profile
: cache profile as of KubeAI documentation.
For example to deploy a model from HuggingFace, the following procedure may be used:
llm_function = project.new_function("llm",
kind="kubeai-text",
model_name="embedding",
url="hf://BAAI/bge-small-en-v1.5",
engine="Infinity",
features=["TextEmbedding"])
llm_run = llm_function.run(action="serve")
Once deployed, the model is available and it is possible to call the OpenAI-compatible API from within the platform or OpenAI client:
from openai import OpenAI
model_name = f"embedding-123qwe"
client = OpenAI(api_key="ignored", base_url=f"http://{KUBEAI_ENDPOINT}/openai/v1")
response = client.embeddings.create(
input="Your text goes here.",
model=model_name
)
By default, the KUBEAI_ENDPOINT
is kubeai:80
.