Digitalhub Platform: HCP Workload Offloading via Interlink
Overview
Digitalhub platform leverages Kubernetes-native orchestration with the capability to dynamically offload machine learning workloads to an external HPC (High Performance Computing) cluster using the InterLink Project. This hybrid approach combines the flexibility of Kubernetes with the raw computational power of HPC resources for demanding ML training and inference tasks.
Architecture
InterLink API and the plugin deployment can be arranged in three different ways across the kubernetes cluster and the remote HPC part. Check InterLink Project documentations to get more informations.
In this example we will use the tunneled deployment scenario.
Prerequisites
- Kubernetes cluster (with Digitalhub platform installed)
- Access to HCP cluster with job scheduler (Slurm, PBS, etc.)
- Interlink project deployed and configured
- Network connectivity between clusters
- Appropriate authentication credentials
Kubernetes Configuration
-
SSH Key Setup Generate an SSH key pair if you don't have one:
# Generate SSH key pair ssh-keygen -t rsa -b 4096 -f ~/.ssh/interlink_rsa # Copy private key to remote server scp ~/.ssh/interlink_rsa user@remote-server:~ # Test SSH connection ssh -i ~/.ssh/interlink_rsa user@remote-server
-
Interlink Setup Deploy the Interlink on your kubernetes cluster:
helm install --create-namespace -n interlink virtual-node \
oci://ghcr.io/intertwin-eu/interlink-helm-chart/interlink \
--values my-values.yaml
nodeName: interlink-socket-node
interlink:
enabled: true
socket: unix:///var/run/interlink.sock
plugin:
socket: unix:///var/run/plugin.sock
sshBastion:
enabled: true
clientKeys:
authorizedKeys: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAI..." # Previosly createt public key
port: 31022
virtualNode:
resources:
CPUs: 8
memGiB: 32
pods: 100
HPC Configuration
- Download interlink-slurm-plugin on your login node.
- Configure InterLink Slurm plugin to listen on a Unix socket instead of a TCP port:
/root/SlurmConfig.yaml
SidecarPort: "" Socket: "unix:///var/run/plugin.sock" SbatchPath: "/usr/bin/sbatch" ScancelPath: "/usr/bin/scancel" SqueuePath: "/usr/bin/squeue" SinfoPath: "/usr/bin/sinfo" CommandPrefix: "" SingularityPrefix: "" SingularityPath: "singularity" ExportPodData: true DataRootFolder: ".local/interlink/jobs/" Namespace: "vk" Tsocks: false TsocksPath: "$WORK/tsocks-1.8beta5+ds1/libtsocks.so" TsocksLoginNode: "login01" BashPath: /bin/bash VerboseLogging: true ErrorsOnlyLogging: false ContainerRuntime: singularity EnrootDefaultOptions: ["--rw"] EnrootPrefix: "" EnrootPath: enroot
-
On the remote HCP login node, start your interLink plugin:
# Example: Start SLURM plugin on remote HPC system cd /path/to/plugin SLURMCONFIGPATH=/root/SlurmConfig.yaml SHARED_FS=true /path/to/plugin/slurm-sidecar
-
Forward slurm plugin Unix socket to ssh the bastion host:
ssh -nNT -L /var/run/plugin.sock:/var/run/plugin.sock user@sshbastiononkubernetes
Post-Installation
Verify Deployment
# Check virtual node status
kubectl get node <nodeName>
# Check pod status
kubectl get pods -n interlink
# View virtual node details
kubectl describe node <nodeName>
# Check logs
kubectl logs -n interlink deployment/<nodeName>-node -c vk
Testing the Virtual Node
# test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-workload
spec:
nodeSelector:
kubernetes.io/hostname: <nodeName>
containers:
- name: test
image: busybox
command: ["sleep", "3600"]
```bash kubectl apply -f test-pod.yaml kubectl get pod test-workload -o wide