Exposing Open Inference v2 Services

The Open Inference protocol v2 is a protocol used by various inference providers (e.g., NVIDIA Triton, KServe) to expose models that can be used for ML inference. It defines standard interfaces, defined by input and output tensor definitions, to define how the interaction should be performed in a platform-agnostic manner.

In this tutorial, we demonstrate how to deploy a Visual Question Answering (VQA) service using the BLIP model from Salesforce and serve it via the OpenInference runtime on the platform.

The workflow covers:

Setting up the project and source directory
Writing the inference service code
Registering and deploying the function
Sending inference requests to the running service

1. Initialize the DigitalHub Project

Import the digitalhub SDK and create (or retrieve) a project named demo. The project acts as a namespace for all functions, artifacts, and runs in DigitalHub.

import digitalhub as dh

project = dh.get_or_create_project("demo")

2. Write the Inference Service Code

Prepare the source directory

from pathlib import Path

Path("src").mkdir(exist_ok=True)

Use the %%writefile magic to write src/openinference_service.py to disk. This file contains:

VQAModel — a wrapper around the Salesforce/blip-image-captioning-base model that loads the BLIP processor and model onto the available device (GPU or CPU) and exposes a generate_caption method for both image captioning and visual question answering.
init_context(context) — called once at server startup; loads the VQAModel and attaches it to the server context.

handler(context, request) — the per-request inference function; decodes the incoming raw image bytes, optionally reads a question parameter, calls generate_caption, and returns the result as a BYTES output tensor.

%%writefile "src/openinference_service.py"

"""
Visual Question Answering Service using BLIP model and PyTriton.

This microservice serves a BLIP model for visual question answering tasks.
It accepts images (JPEG/PNG) and returns text descriptions.
"""

import base64

from io import BytesIO
from PIL import Image
import json

# Import transformers for BLIP model
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

# Configure logging
import time


class VQAModel:
    """Visual Question Answering model wrapper for BLIP."""

    def __init__(self, model_name: str = "Salesforce/blip-image-captioning-base"):
        """
        Initialize the BLIP model and processor.

        Args:
            model_name: HuggingFace model identifier
        """
        print(f"Loading BLIP model: {model_name}")
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {self.device}")

        self.processor = BlipProcessor.from_pretrained(model_name)
        self.model = BlipForConditionalGeneration.from_pretrained(model_name).to(self.device)
        print("Model loaded successfully")

    def generate_caption(self, image: Image.Image, question: str = None) -> str:
        """
        Generate a caption or answer for the given image.

        Args:
            image: PIL Image
            question: Optional question text for VQA

        Returns:
            Generated text description
        """
        if question:
            # Visual question answering
            inputs = self.processor(image, question, return_tensors="pt").to(self.device)
        else:
            # Image captioning
            inputs = self.processor(image, return_tensors="pt").to(self.device)

        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=50)

        caption = self.processor.decode(output[0], skip_special_tokens=True)
        return caption


def init_context(context):
    """Initialize the VQA model at server startup."""
    vqa_model = VQAModel()
    setattr(context, "model", vqa_model)

def handler(context, request):
    """
    Inference function for the VQA model.

    Args:
        request: Inference Request object

    Returns:
        List inference response
    """
    vqa_model = getattr(context, 'model', None)
    if vqa_model is None:
        init_context(context)
        vqa_model = getattr(context, 'model', None)

    vqa_model = context.model
    image_bytes = request.inputs[0].data

    caption = ""
    try:
        # Convert bytes to PIL Image
        image_bytes_clean = bytes(image_bytes)
        image = Image.open(BytesIO(image_bytes_clean)).convert('RGB')
        # time.sleep(2)  # Simulate processing time
        # Generate caption

        if request.parameters and "question" in request.parameters:
            question = request.parameters["question"]
            caption = vqa_model.generate_caption(image, question)
        else:
            caption = vqa_model.generate_caption(image)

        context.logger.info(f"Generated caption: {caption}")

    except Exception as e:
        context.logger.error(f"Error processing image: {e}")
        # caption = f"Error: {str(e)}"


    # Convert results to numpy array with object dtype for variable-length strings
    return {
        "outputs": 
            [
                {"name": "caption", "datatype": "BYTES", "data": [caption], "shape": [1, len(caption)]}
            ]

    }

3. Register the Function in DigitalHub

Create a new OpenInference function in the project that points to the service file written above. Key parameters:

Parameter	Value	Description
`kind`	`openinference`	Runtime type used by DigitalHub
`handler`	`handler`	Per-request inference entry point
`init_function`	`init_context`	Called once at startup to load the model
`model_name`	`vqa_blip`	Model name exposed by the inference server
`inputs`	`image (UINT8)`	Raw image bytes accepted by the model
`outputs`	`caption (BYTES)`	Text caption/answer returned by the model
`requirements`	torch, transformers, pillow	Python packages installed in the serving environment

func = project.new_function(name="vqa-oi",
                            kind="openinference",
                            python_version="PYTHON3_10",
                            code_src="src/openinference_service.py",
                            handler="handler",
                            init_function="init_context",
                            model_name="vqa_blip",
                            inputs=[{
                                "name": "image",
                                "datatype": "UINT8",
                                "shape": [-1,-1]
                            }],
                            outputs=[{
                                "name": "caption",
                                "datatype": "BYTES",
                                "shape": [-1,-1]
                            }],
                            requirements=["torch>=2.0.0", "transformers>=4.30.0", "pillow>=10.0.0"]
                           )

4. Deploy the Function as a Serving Endpoint

Run the registered function with the serve action to start an inference server. The resource requests ensure the BLIP model weights can be downloaded and held in memory.

Note: This cell is kept as a raw cell to prevent accidental re-deployment. Convert it to a code cell and run it when you want to (re)deploy the service.

run = func.run(action="serve", resources={"mem": "4Gi", "disk": "30Gi"})

5. Prepare an Inference Request

Configure the client with the service URL and model name, then download a sample parrot image from HuggingFace. The image bytes are wrapped into a KServe v2 inference request payload with a single UINT8 input tensor.

import requests
import urllib.request

BASE_URL = run.refresh().status.service['url']
MODEL_NAME = 'vqa_blip'

img_url = 'https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png'
with urllib.request.urlopen(img_url) as response:
    image_bytes = response.read()

    request = {
        "inputs": [
                {
                    "name": "input",
                    "datatype": "UINT8",
                    "shape": [1, len(image_bytes)],
                    "data": list(image_bytes)
                }
            ]    
    }

6. Send the Inference Request

POST the prepared payload to the KServe v2 inference endpoint (/v2/models/{model_name}/infer) and print the HTTP status code to confirm the request was received successfully.

response = requests.post(
        f"http://{BASE_URL}/v2/models/{MODEL_NAME}/infer",
        json=request,
        headers={"Content-Type": "application/json"}
    )

print(f"Status Code: {response.status_code}")
data = response.json()

7. Inspect the Response

Display the full JSON response returned by the inference server. The response follows the Open Inference v2 format and includes an outputs array with the model's generated caption for the image.

print(data)