Managing LLM Models
With the platform it is possible to create and serve LLM HuggingFace-compatible-models. Specifically, it is possible to serve directly the LLM models from the HuggingFace catalogue provided the id of the model or to serve the fine-tuned model from the specified path, such as S3.
LLM implementation relies on the KServe LLM runtime and therefore supports one of the corresponding LLM tasks:
- Text Generation
- Text2Text Generation
- Fill Mask
- Text (Sequence) Classification
- Token Classification
Based on the type of the task the API of the exposed service may differ. Generative models (text generation and text2text generation) use OpenAI's Completion and Chat Completion API.
The other types of tasks like token classification, sequence classification, fill mask are served using KServe's Open Inference Protocol v2 API.
Exposing Predefined Text Classification Models
In case of predefined HuggingFace non-generative model it is possible to use huggingfaceserve runtime to expose the corresponding inference API. For this purpose it is necessary to define the huggingfaceserve function definition (via UI or SDK) providing the name of the exposed model and the URI of the model in the following form
huggingface://<id of the huggingface model>
For example huggingface://distilbert/distilbert-base-uncased-finetuned-sst-2-english.
When using SDK, this may be accomplished as follows.
First, import necessary libraries and create a project to host the functions and executions
import digitalhub as dh
project = dh.get_or_create_project("llm")
Create the serving function definition:
llm_function = project.new_function("llm-classification",
kind="huggingfaceserve",
model_name="mymodel",
path="huggingface://distilbert/distilbert-base-uncased-finetuned-sst-2-english")
Serve the model:
llm_run = llm_function.run(action="serve", profile="1xa100", wait=True)
Please note the use of the profile parameter. As the LLM models require specific hardware (GPU in particular), it is necessary to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment.
As in other scenarios, you need to wait a bit for the service to become available. Once the service becomes available, it is possible to make the calls:
json = {
"inputs": [
{
"name": "input-0",
"shape": [2],
"datatype": "BYTES",
"data": ["Hello, my dog is cute", "I am feeling sad"],
},
]
}
llm_run.invoke(model_name="mymodel", json=json).json()
Here the classification LLM service API follows the Open Inference protocol API and the expected result should have the following form:
{
"model_name": "mymodel",
"model_version": None,
"id": "cab30aa5-c10f-4233-94e2-14e4bc8fbf6f",
"parameters": None,
"outputs": [
{
"name": "output-0",
"shape": [2],
"datatype": "INT64",
"parameters": None,
"data": [1, 0],
},
],
}
As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.
Exposing Predefined Text Generation Models
In case of predefined HuggingFace generative model it is possible to use huggingfaceserve runtime to expose the OpenAI compatible API. For this purpose it is necessary to define the huggingfaceserve function definition (via UI or SDK) providing the name of the exposed model and the URI of the model in the following form
huggingface://<id of the huggingface model>
For example huggingface://meta-llama/meta-llama-3-8b-instruct.
When using SDK, this may be accomplished as follows.
First, import the necessary libraries and create a project to host the functions and runs:
import digitalhub as dh
project = dh.get_or_create_project("llm")
Create the serving function definition:
llm_function = project.new_function("llm-generation",
kind="huggingfaceserve",
model_name="mymodel",
path="huggingface://meta-llama/meta-llama-3-8b-instruct")
Next, we serve the model. This particular one is protected, so you need to provide a HuggingFace token with access to it. As the model is large, we use a profile with more resources.
hf_token = "<HUGGINGFACE TOKEN>"
llm_run = llm_function.run(action="serve",
profile="1xa100-80GB",
envs = [{"name": "HF_TOKEN", "value": hf_token}],
wait=True)
Deployment time
Mind that when requesting a GPU node for the service, it may take some time for the service to start, in some cases up to 10 minutes.
As in case of classification models, the LLM models require specific hardware (GPU in particular), it is necessary to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment.
Once the service becomes available, it is possible to make the calls. For example, for the completion requests:
service_url = llm_run.refresh().status.to_dict()["service"]["url"]
url = f"http://{service_url}/openai/v1/completions"
json = {
"model": "mymodel",
"prompt": "Hello! How are you?",
"stream": False,
"max_tokens": 30
}
llm_run.invoke(url=url, json=json).json()
Here the expected output should have the following form:
{
"id":"cmpl-69dd8b1ea70c477fbf80c353ac73b52e",
"choices":[
{
"finish_reason":"length",
"index":0,
"logprobs":"None",
"text":" Hope you're having a great day!\n\nHere I'd like to share some news about my new podcast, where I'll be exploring the world of..."
}
],
"created":1761210462,
"model":"mymodel",
"system_fingerprint":"None",
"object":"text_completion",
"usage":{
"completion_tokens":30,
"prompt_tokens":7,
"total_tokens":37
}
}
In case of chat requests:
service_url = llm_run.refresh().status.to_dict()["service"]["url"]
url = f'http://{service_url}/openai/v1/chat/completions'
json = {
"model": "mymodel",
"messages": [
{"role": "system", "content": "You are an assistant that speaks like Shakespeare."},
{"role": "user", "content": "Write a poem about colors"}
],
"max_tokens": 30,
"stream": False
}
llm_run.invoke(url=url, json=json).json()
Expected output:
{
"id": "cmpl-9aad539128294069bf1e406a5cba03d3",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": " O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1718638005,
"model": "mymodel",
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 37,
"total_tokens": 67
}
}
As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.
Fine-tuned LLM model
When it comes to custom LLM model, it is possible to create a HuggingFace-based fine-tuned model, log it and serve it from the model path.
When using the SDK, this may be accomplished as follows.
First, import the necessary libraries and create a project to host the functions and runs:
import digitalhub as dh
project = dh.get_or_create_project("llm")
Create a directory for the code:
from pathlib import Path
Path("src").mkdir(exist_ok=True)
Create the training procedure that logs the model to the platform:
%%writefile "src/train_model.py"
import os
import evaluate
import numpy as np
from datasets import load_dataset
from digitalhub_runtime_python import handler
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
@handler()
def train(project):
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
metric = evaluate.load("accuracy")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
dataset = load_dataset("yelp_review_full")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer")
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
save_model = "model"
if not os.path.exists(save_model):
os.makedirs(save_model)
save_dir = "model"
if not os.path.exists(save_dir):
os.makedirs(save_dir)
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
project.log_model(
name="test_llm_model",
kind="huggingface",
base_model="google-bert/bert-base-cased",
source=save_dir,
)
Register the function:
train_func = project.new_function(name="train_model",
kind="python",
python_version="PYTHON3_10",
code_src="src/train_model.py",
handler="train",
requirements=["evaluate", "transformers[torch]", "torch", "torchvision", "accelerate"])
Run it:
train_run = train_func.run(action="job", profile="1xa100", wait=True)
A new model should have been created in the project. We need its path:
llm_model_path = project.get_model("test_llm_model").spec.path
Create the serving function definition:
llm_function = project.new_function("llm-classification",
kind="huggingfaceserve",
model_name="mymodel",
path=llm_model_path)
Serve the model:
llm_run = llm_function.run(action="serve", profile="1xa100", volumes=[{
"volume_type": "persistent_volume_claim",
"name": "volume-llmpa",
"mount_path": "/shared",
"spec": { "size": "10Gi" }}]
)
Please note the use of the profile parameter. As the LLM models require specific hardware (GPU in particular), it is necessary
to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment. Also, in case of large models the default disk space may be insufficient and an extra volume should be configured for the underlying deployment.
Once the service becomes available, it is possible to make the calls:
json = {
"inputs": [
{
"name": "input-0",
"shape": [2],
"datatype": "BYTES",
"data": ["Hello, my dog is cute", "I am feeling sad"],
}
]
}
llm_run.refresh().invoke(model_name="mymodel", json=json).json()
Here the classification LLM service API follows the Open Inference protocol API and the expected result should have the following form:
{
"model_name": "mymodel",
"model_version": None,
"id": "cab30aa5-c10f-4233-94e2-14e4bc8fbf6f",
"parameters": None,
"outputs": [
{
"name": "output-0",
"shape": [2],
"datatype": "INT64",
"parameters": None,
"data": [4, 0],
}
],
}
As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.