LLM

The first step is to deploy and serve a pre-trained Large Language Model. We'll work with the Llama model for text generation.

Project initialization

Initialize a project on the platform:

import digitalhub as dh
import getpass as gt

USERNAME = gt.getuser()

project = dh.get_or_create_project(f"{USERNAME}-tutorial-project")
print(project.name)

Model configuration

We'll create a function to serve the LLama3.2 model directly. The model path may use different protocols, such as ollama:// or hf://, to directly reference models from the corresponding hub, without manual downloading.

llm_function = project.new_function(
    name="llama32-1b",
    kind="kubeai-text",
    model_name=f"{USERNAME}-model",
    url="ollama://llama3.2:1b",
    engine='OLlama',
    features=['TextGeneration']
)

Model serving

To deploy the model, we use a GPU profile to accelerate the generation.

llm_run = llm_function.run("serve", profile="1xa100-80GB", wait=True)

Let's check that our service is running and ready to accept requests:

service = llm_run.refresh().status.service
print("Service status:", service)

When the service is ready, we need to wait for the model to be downloaded and deployed.

status = llm_run.refresh().status.k8s.get("Model")['status']
print("Model status:", status)

Once ready, we save the URL and model:

CHAT_URL = llm_run.status.to_dict()["service"]["url"]
CHAT_MODEL = llm_run.status.to_dict()["openai"]["model"]
print(f"service {CHAT_URL} with model {CHAT_MODEL}")

Test the LLM API

Let's test our deployed model with a prompt:

model_name =llm_run.refresh().status.k8s.get("Model").get("metadata").get("name")
json_payload = {'model': model_name, 'prompt': 'Describe MLOps'}

import pprint
pp = pprint.PrettyPrinter(indent=2)
result = llm_run.invoke(model_name=model_name, json=json_payload, url=service['url']+'/v1/completions').json()
print("Response:")
pp.pprint(result)

The response contains the answer, as well as some usage parameters.