Deployment Guides#
Below we include minimal guides to set up a new Kubernetes cluster in different environments, including hosted services on the cloud.
Deploying locally on your laptop#
To try out SkyPilot on Kubernetes on your laptop or run SkyPilot
tasks locally without requiring any cloud access, we provide the
sky local up
CLI to create a 1-node Kubernetes cluster locally.
Under the hood, sky local up
uses kind,
a tool for creating a Kubernetes cluster on your local machine.
It runs a Kubernetes cluster inside a container, so no setup is required.
Run
sky local up
to launch a Kubernetes cluster and automatically configure your kubeconfig file:$ sky local up
Run
sky check
and verify that Kubernetes is enabled in SkyPilot. You can now run SkyPilot tasks on this locally hosted Kubernetes cluster usingsky launch
.After you are done using the cluster, you can remove it with
sky local down
. This will destroy the local kubernetes cluster and switch your kubeconfig back to it’s original context:$ sky local down
Note
We recommend allocating at least 4 or more CPUs to your docker runtime to ensure kind has enough resources. See instructions to increase CPU allocation here.
Note
kind does not support multiple nodes and GPUs. It is not recommended for use in a production environment. If you want to run a private on-prem cluster, see the section on on-prem deployment for more.
Deploying on Google Cloud GKE#
Create a GKE standard cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
Example: create a GKE cluster with 2 nodes, each having 16 CPUs.
PROJECT_ID=$(gcloud config get-value project) CLUSTER_NAME=testcluster gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.4-gke.1043002" --release-channel "regular" --machine-type "e2-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"
Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the GKE cluster:$ gcloud container clusters get-credentials <cluster-name> --region <region> # Example: # gcloud container clusters get-credentials testcluster --region us-central1-c
[If using GPUs] For GKE versions newer than 1.30.1-gke.115600, NVIDIA drivers are pre-installed and no additional setup is required. If you are using an older GKE version, you may need to manually install NVIDIA drivers for GPU support. You can do so by deploying the daemonset depending on the GPU and OS on your nodes:
# For Container Optimized OS (COS) based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...): $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml # For Container Optimized OS (COS) based nodes with L4 GPUs: $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml # For Ubuntu based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...): $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml # For Ubuntu based nodes with L4 GPUs: $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
Tip
To verify if GPU drivers are set up, run
kubectl describe nodes
and verify thatnvidia.com/gpu
resource is listed under theCapacity
section.Verify your kubernetes cluster is correctly set up for SkyPilot by running
sky check
:$ sky check
[If using GPUs] Check available GPUs in the kubernetes cluster with
sky show-gpus --cloud k8s
$ sky show-gpus --cloud k8s GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS L4 1, 2, 4 8 6 A100 1, 2 4 2 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS my-cluster-0 L4 4 4 my-cluster-1 L4 4 2 my-cluster-2 A100 2 2 my-cluster-3 A100 2 0
Note
GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
Deploying on Amazon EKS#
Create a EKS cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.
Get the kubeconfig for your cluster. The following command will automatically update
~/.kube/config
with new kubecontext for the EKS cluster:$ aws eks update-kubeconfig --name <cluster-name> --region <region> # Example: # aws eks update-kubeconfig --name testcluster --region us-west-2
[If using GPUs] EKS clusters already come with Nvidia drivers set up. However, you will need to label the nodes with the GPU type. Use the SkyPilot node labelling tool to do so:
python -m sky.utils.kubernetes.gpu_labeler
This will create a job on each node to read the GPU type from nvidia-smi and assign a
skypilot.co/accelerator
label to the node. You can check the status of these jobs by running:kubectl get jobs -n kube-system
Verify your kubernetes cluster is correctly set up for SkyPilot by running
sky check
:$ sky check
[If using GPUs] Check available GPUs in the kubernetes cluster with
sky show-gpus --cloud k8s
$ sky show-gpus --cloud k8s GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS A100 1, 2 4 2 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS my-cluster-0 A100 2 2
Deploying on on-prem clusters#
If you have a list of IP addresses and the SSH credentials for your on-prem cluster, you can follow our Using Existing Machines guide to set up SkyPilot on your on-prem cluster.
Alternatively, you can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools, such as kubeadm, k3s or Rancher. Please follow their respective guides to deploy your Kubernetes cluster.
Notes for specific Kubernetes distributions#
Some Kubernetes distributions require additional steps to set up GPU support.
Rancher Kubernetes Engine 2 (RKE2)#
Nvidia GPU operator installation on RKE2 through helm requires extra flags to set nvidia
as the default runtime for containerd.
$ helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl' \
--set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[2].value=nvidia' \
--set 'toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT' \
--set-string 'toolkit.env[3].value=true'
Refer to instructions on Nvidia GPU Operator installation with Helm on RKE2 for details.
K3s#
Installing Nvidia GPU operator on K3s is similar to RKE2 instructions from Nvidia, but requires changing
the CONTAINERD_CONFIG
variable to /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
. Here is an example command to install the Nvidia GPU operator on K3s:
$ helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
--set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml' \
--set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
--set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
--set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
--set 'toolkit.env[2].value=nvidia'
Check the status of the GPU operator installation by running kubectl get pods -n gpu-operator
. It takes a few minutes to install and some CrashLoopBackOff errors are expected during the installation process.
Tip
If your gpu-operator installation stays stuck in CrashLoopBackOff, you may need to create a symlink to the ldconfig
binary to work around a known issue with nvidia-docker runtime. Run the following command on your nodes:
$ ln -s /sbin/ldconfig /sbin/ldconfig.real
After the GPU operator is installed, create the nvidia RuntimeClass required by K3s. This runtime class will automatically be used by SkyPilot to schedule GPU pods:
$ kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
Deploying on cloud VMs#
You can also spin up on-demand cloud VMs and deploy Kubernetes on them.
We provide scripts to take care of provisioning VMs, installing Kubernetes, setting up GPU support and configuring your local kubeconfig. Refer to our Deploying Kubernetes on VMs guide for more details.