Managing LLM Models

With the platform it is possible to create and serve LLM HuggingFace-compatible-models. Specifically, it is possible to serve directly the LLM models from the HuggingFace catalogue provided the id of the model or to serve the fine-tuned model from the specified path, such as S3.

LLM implementation relies on the KServe LLM runtime and therefore supports one of the corresponding LLM tasks:

Text Generation
Text2Text Generation
Fill Mask
Text (Sequence) Classification
Token Classification

Based on the type of the task the API of the exposed service may differ. Generative models (text generation and text2text generation) use OpenAI's Completion and Chat Completion API.

The other types of tasks like token classification, sequence classification, fill mask are served using KServe's Open Inference Protocol v2 API.

You can find the final notebook file for this scenario in the tutorial repository.

Exposing Predefined Text Classification Models

In case of predefined HuggingFace non-generative model it is possible to use huggingfaceserve runtime to expose the corresponding inference API. For this purpose it is necessary to define the huggingfaceserve function definition (via UI or SDK) providing the name of the exposed model and the URI of the model in the following form

huggingface://<id of the huggingface model>

For example huggingface://distilbert/distilbert-base-uncased-finetuned-sst-2-english.

When using SDK, this may be accomplished as follows.

First, import necessary libraries and create a project to host the functions and executions

import digitalhub as dh

project = dh.get_or_create_project("llm")

Create the serving function definition:

llm_function = project.new_function("llm_classification",
                                    kind="huggingfaceserve",
                                    model_name="mymodel",
                                    path="huggingface://distilbert/distilbert-base-uncased-finetuned-sst-2-english")

Serve the model:

llm_run = llm_function.run(action="serve", profile="1xa100", wait=True)

Please note the use of the profile parameter. As the LLM models require specific hardware (GPU in particular), it is necessary to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment.

As in other scenarios, you need to wait a bit for the service to become available. Once the service becomes available, it is possible to make the calls:

model_name = "mymodel"
json = {
    "inputs": [
        {
            "name": "input-0",
            "shape": [2],
            "datatype": "BYTES",
            "data": ["Hello, my dog is cute", "I am feeling sad"],
        },
    ]
}

llm_run.invoke(model_name=model_name, json=json).json()

Here the classification LLM service API follows the Open Inference protocol API and the expected result should have the following form:

{
    "model_name": "mymodel",
    "model_version": None,
    "id": "cab30aa5-c10f-4233-94e2-14e4bc8fbf6f",
    "parameters": None,
    "outputs": [
        {
            "name": "output-0",
            "shape": [2],
            "datatype": "INT64",
            "parameters": None,
            "data": [1, 0],
        },
    ],
}

As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.

Exposing Predefined Text Generation Models

In case of predefined HuggingFace ngenerative model it is possible to use huggingfaceserve runtime to expose the OpenAI compatible API. For this purpose it is necessary to define the huggingfaceserve function definition (via UI or SDK) providing the name of the exposed model and the URI of the model in the following form

huggingface://<id of the huggingface model>

For example huggingface://meta-llama/meta-llama-3-8b-instruct.

When using SDK, this may be accomplished as follows.

First, import necessary libraries and create a project to host the functions and executions

import digitalhub as dh

project = dh.get_or_create_project("llm")

Create the serving function definition:

llm_function = project.new_function("llm_generation",
                                    kind="huggingfaceserve",
                                    model_name="mymodel",
                                    path="huggingface://meta-llama/meta-llama-3-8b-instruct")

Serve the model:

llm_run = llm_function.run(action="serve", profile="1xa100", wait=True)

Please note that in case of protected models (like, e.g., llama models) it is necessary to path the HuggingFace token. For example,

hf_token = "<HUGGINGFACE TOKEN>"
llm_run = llm_function.run(action="serve",
                           profile="1xa100",
                           envs = [{"name": "HF_TOKEN", "value": hf_token}],
                           wait=True)

Deployment time

Mind that when requesting a GPU node for the service, it may take some time for the service to start, in some cases up to 10 minutes.

As in case of classification models, the LLM models require specific hardware (GPU in particular), it is necessary to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment.

Once the service becomes available, it is possible to make the calls. For example, for the completion requests:

service_url = llm_run.refresh().status.to_dict()["service"]["url"]
url = f"http://{service_url}/openai/v1/completions"
model_name = "mymodel"
json = {
    "model": model_name,
    "prompt": "Hello! How are you?",
    "stream": False,
    "max_tokens": 30
}

llm_run.invoke(url=url, json=json).json()

Here the expected output should have the following form:

{
  "id": "cmpl-625a9240f25e463487a9b6c53cbed080",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": " and how they make you feel\nColors, oh colors, so vibrant and bright\nA world of emotions, a kaleidoscope in sight\nRed"
    }
  ],
  "created": 1718620153,
  "model": "mymodel",
  "system_fingerprint": null,
  "object": "text_completion",
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 6,
    "total_tokens": 36
  }
}

In case of chat requests:

service_url = llm_run.refresh().status.to_dict()["service"]["url"]
url = f'http://{service_url}/openai/v1/chat/completions'

model_name = "mymodel"

json = {
    "model": model_name,
    "messages": [
        {"role": "system", "content": "You are an assistant that speaks like Shakespeare."},
        {"role": "user", "content": "Write a poem about colors"}
    ],
    "max_tokens": 30,
    "stream": False
}

llm_run.invoke(url=url, json=json).json()

Expected output:

{
  "id": "cmpl-9aad539128294069bf1e406a5cba03d3",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "  O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": null
    }
  ],
  "created": 1718638005,
  "model": "mymodel",
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 37,
    "total_tokens": 67
  }
}

As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.

Fine-tuned LLM model

when it comes to custom LLM model, it is possible to create HuggingFace-based fine tuned model, log it and then serve it from the model path.

When using SDK, this may be accomplished as follows.

First, import necessary libraries

First, import necessary libraries and create a project to host the functions and executions

import digitalhub as dh

project = dh.get_or_create_project("llm")

Create the training procedure that logs model to the platform:

%%writefile "src/train_model.py"

import os

import evaluate
import numpy as np
from datasets import load_dataset
from digitalhub_runtime_python import handler
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments


@handler()
def train(project):
    tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
    metric = evaluate.load("accuracy")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    dataset = load_dataset("yelp_review_full")
    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

    training_args = TrainingArguments(output_dir="test_trainer")

    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

    training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    trainer.train()

    save_model = "model"
    if not os.path.exists(save_model):
        os.makedirs(save_model)

    save_dir = "model"
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    trainer.save_model(save_dir)
    tokenizer.save_pretrained(save_dir)

    project.log_model(
        name="test_llm_model",
        kind="huggingface",
        base_model="google-bert/bert-base-cased",
        source=save_dir,
    )

Register the function and execute it:

train_func = project.new_function(name="train_model",
                                  kind="python",
                                  python_version="PYTHON3_10",
                                  code_src="src/train_model.py",
                                  handler="train",
                                  requirements=["evaluate", "transformers[torch]", "torch", "torchvision", "accelerate"])

train_run=train_func.run(action="job", profile="1xa100", wait=True)

Create the serving function definition:

llm_function = project.new_function("llm_classification",
                                    kind="huggingfaceserve",
                                    model_name="mymodel",
                                    path="s3://datalake/llm/model/test_llm_model/f8026820-2471-4497-97f5-8e6d49baac5f/")

Serve the model:

llm_run = llm_function.run(action="serve", profile="1xa100", wait=True)

Please note the use of the profile parameter. As the LLM models require specific hardware (GPU in particular), it is necessary to specify the HW requirements as described in the Configuring Kubernetes executions section. In particular, it is possible to rely on the predefined resource templates of the platform deployment.

Once the service becomes available, it is possible to make the calls:

model_name = "mymodel"
json = {
    "inputs": [
        {
            "name": "input-0",
            "shape": [2],
            "datatype": "BYTES",
            "data": ["Hello, my dog is cute", "I am feeling sad"],
        }
    ]
}

llm_run.invoke(model_name=model_name, json=json).json()

Here the classification LLM service API follows the Open Inference protocol API and the expected result should have the following form:

{
    "model_name": "mymodel",
    "model_version": None,
    "id": "cab30aa5-c10f-4233-94e2-14e4bc8fbf6f",
    "parameters": None,
    "outputs": [
        {
            "name": "output-0",
            "shape": [2],
            "datatype": "INT64",
            "parameters": None,
            "data": [4, 0],
        }
    ],
}

As in case of other services (ML model services or Serverless functions), it is possible to expose the service using the KRM API gateway functionality.