Digitalhub Platform: HCP Workload Offloading via Interlink

Overview

Digitalhub platform leverages Kubernetes-native orchestration with the capability to dynamically offload machine learning workloads to an external HPC (High Performance Computing) cluster using the InterLink Project. This hybrid approach combines the flexibility of Kubernetes with the raw computational power of HPC resources for demanding ML training and inference tasks.

Architecture

InterLink API and the plugin deployment can be arranged in three different ways across the kubernetes cluster and the remote HPC part. Check InterLink Project documentations to get more informations.

In this example we will use the tunneled deployment scenario.

Prerequisites

Kubernetes cluster (with Digitalhub platform installed)
Access to HCP cluster with job scheduler (Slurm, PBS, etc.)
Interlink project deployed and configured
Network connectivity between clusters
Appropriate authentication credentials

Kubernetes Configuration

SSH Key Setup Generate an SSH key pair if you don't have one:

# Generate SSH key pair
ssh-keygen -t rsa -b 4096 -f ~/.ssh/interlink_rsa

# Copy private key to remote server
scp ~/.ssh/interlink_rsa user@remote-server:~

# Test SSH connection
ssh -i ~/.ssh/interlink_rsa user@remote-server

Interlink Setup Deploy the Interlink on your kubernetes cluster:

helm install --create-namespace -n interlink virtual-node \
  oci://ghcr.io/intertwin-eu/interlink-helm-chart/interlink \
  --values my-values.yaml

my-values.yaml

nodeName: interlink-socket-node

interlink:
  enabled: true
  socket: unix:///var/run/interlink.sock

plugin:
  socket: unix:///var/run/plugin.sock

sshBastion:
  enabled: true
  clientKeys:
    authorizedKeys: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI..." # Previosly createt public key
  port: 31022

virtualNode:
  resources:
    CPUs: 8
    memGiB: 32
    pods: 100

HPC Configuration

Download interlink-slurm-plugin on your login node.

Configure InterLink Slurm plugin to listen on a Unix socket instead of a TCP port:

/root/SlurmConfig.yaml

SidecarPort: ""
Socket: "unix:///var/run/plugin.sock"
SbatchPath: "/usr/bin/sbatch"
ScancelPath: "/usr/bin/scancel"
SqueuePath: "/usr/bin/squeue"
SinfoPath: "/usr/bin/sinfo"
CommandPrefix: ""
SingularityPrefix: ""
SingularityPath: "singularity"
ExportPodData: true
DataRootFolder: ".local/interlink/jobs/"
Namespace: "vk"
Tsocks: false
TsocksPath: "$WORK/tsocks-1.8beta5+ds1/libtsocks.so"
TsocksLoginNode: "login01"
BashPath: /bin/bash
VerboseLogging: true
ErrorsOnlyLogging: false
ContainerRuntime: singularity
EnrootDefaultOptions: ["--rw"]
EnrootPrefix: ""
EnrootPath: enroot

On the remote HCP login node, start your interLink plugin:

# Example: Start SLURM plugin on remote HPC system
cd /path/to/plugin
SLURMCONFIGPATH=/root/SlurmConfig.yaml SHARED_FS=true /path/to/plugin/slurm-sidecar

Forward slurm plugin Unix socket to ssh the bastion host:

ssh -nNT -L /var/run/plugin.sock:/var/run/plugin.sock user@sshbastiononkubernetes

Post-Installation

Verify Deployment

# Check virtual node status
kubectl get node <nodeName>

# Check pod status
kubectl get pods -n interlink

# View virtual node details
kubectl describe node <nodeName>

# Check logs
kubectl logs -n interlink deployment/<nodeName>-node -c vk

Testing the Virtual Node

# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-workload
spec:
  nodeSelector:
    kubernetes.io/hostname: <nodeName>
  containers:
  - name: test
    image: busybox
    command: ["sleep", "3600"]

```bash kubectl apply -f test-pod.yaml kubectl get pod test-workload -o wide