Using AMD GPUs on Kubernetes#
SkyPilot supports AMD GPUs on Kubernetes. This guide walks through the steps to enable AMD GPU detection and scheduling in SkyPilot on Kubernetes.
Dependency installation#
Install cert-manager#
cert-manager is a dependency for the AMD GPU operator.
helm repo add jetstack https://charts.jetstack.io --force-update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.15.1 \
--set crds.enabled=true
Add AMD GPU operator Helm repo#
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update
Install AMD GPU operator#
helm install amd-gpu-operator rocm/gpu-operator-charts \
--namespace kube-amd-gpu --create-namespace \
--version=v1.3.0
GPU detection and labeling#
Check for AMD GPU labels#
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}' | grep -e "amd.com/gpu"
Check node capacity#
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity}'
Sample output:
{
"name": "<node-name>",
"capacity": {
"amd.com/gpu": "8",
"cpu": "160",
"ephemeral-storage": "2077109340Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "1981633424Ki",
"pods": "110"
}
}
Label nodes for SkyPilot#
kubectl label node <node-name> skypilot.co/accelerator=mi300
Verify labels#
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}' | grep -e "skypilot.co/accelerator"
Launch a cluster with SkyPilot#
Check Kubernetes cluster is enabled for SkyPilot#
sky check kubernetes
Sample output:
🎉 Enabled infra 🎉
Kubernetes [compute]
Allowed contexts:
└── <your context name>
List available accelerators (AMD GPUs)#
sky show-gpus --infra kubernetes
Sample output:
Kubernetes GPUs
Context: <your context name>
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
MI300 1, 2, 4, 8 6 of 8 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
<your context name> mi300-8gpus MI300 6 of 8 free
Run a sample example with AMD Docker images#
Smoke test:
sky launch -c amd-cluster examples/amd/amd_smoke_test.yaml
Pytorch example Reinforcement learning:
sky launch -c amd-cluster examples/amd/amd_pytorch_RL.yaml
More example task YAMLs are available under examples/amd directory.