Deployment Guides#

Below we include minimal guides to set up a new Kubernetes cluster in different environments, including hosted services on the cloud.

Local Development Cluster

Run a local Kubernetes cluster on your laptop with sky local up.

On-prem Clusters (RKE2, K3s, etc.)

For on-prem deployments with kubeadm, RKE2, K3s or other distributions.

Google Cloud - GKE

Google’s hosted Kubernetes service.

Amazon - EKS

Amazon’s hosted Kubernetes service.

Deploying locally on your laptop#

To try out SkyPilot on Kubernetes on your laptop or run SkyPilot tasks locally without requiring any cloud access, we provide the sky local up CLI to create a 1-node Kubernetes cluster locally.

Under the hood, sky local up uses kind, a tool for creating a Kubernetes cluster on your local machine. It runs a Kubernetes cluster inside a container, so no setup is required.

  1. Install Docker and kind.

  2. Run sky local up to launch a Kubernetes cluster and automatically configure your kubeconfig file:

    $ sky local up
    
  3. Run sky check and verify that Kubernetes is enabled in SkyPilot. You can now run SkyPilot tasks on this locally hosted Kubernetes cluster using sky launch.

  4. After you are done using the cluster, you can remove it with sky local down. This will destroy the local kubernetes cluster and switch your kubeconfig back to it’s original context:

    $ sky local down
    

Note

We recommend allocating at least 4 or more CPUs to your docker runtime to ensure kind has enough resources. See instructions to increase CPU allocation here.

Note

kind does not support multiple nodes and GPUs. It is not recommended for use in a production environment. If you want to run a private on-prem cluster, see the section on on-prem deployment for more.

Deploying on Google Cloud GKE#

  1. Create a GKE standard cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.

    Example: create a GKE cluster with 2 nodes, each having 16 CPUs.
    PROJECT_ID=$(gcloud config get-value project)
    CLUSTER_NAME=testcluster
    gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.4-gke.1043002" --release-channel "regular" --machine-type "e2-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c"
    
  2. Get the kubeconfig for your cluster. The following command will automatically update ~/.kube/config with new kubecontext for the GKE cluster:

    $ gcloud container clusters get-credentials <cluster-name> --region <region>
    
    # Example:
    # gcloud container clusters get-credentials testcluster --region us-central1-c
    
  3. [If using GPUs] If your GKE nodes have GPUs, you may need to to manually install nvidia drivers. You can do so by deploying the daemonset depending on the GPU and OS on your nodes:

    # For Container Optimized OS (COS) based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    
    # For Container Optimized OS (COS) based nodes with L4 GPUs:
    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
    
    # For Ubuntu based nodes with GPUs other than Nvidia L4 (e.g., V100, A100, ...):
    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
    
    # For Ubuntu based nodes with L4 GPUs:
    $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
    

    To verify if GPU drivers are set up, run kubectl describe nodes and verify that nvidia.com/gpu is listed under the Capacity section.

  4. Verify your kubernetes cluster is correctly set up for SkyPilot by running sky check:

    $ sky check
    
  5. [If using GPUs] Check available GPUs in the kubernetes cluster with sky show-gpus --cloud kubernetes

    $ sky show-gpus --cloud kubernetes
    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
    L4    1, 2, 3, 4    8           6
    A100  1, 2          4           2
    

Note

GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.

Deploying on Amazon EKS#

  1. Create a EKS cluster with at least 1 node. We recommend creating nodes with at least 4 vCPUs.

  2. Get the kubeconfig for your cluster. The following command will automatically update ~/.kube/config with new kubecontext for the EKS cluster:

    $ aws eks update-kubeconfig --name <cluster-name> --region <region>
    
    # Example:
    # aws eks update-kubeconfig --name testcluster --region us-west-2
    
  3. [If using GPUs] EKS clusters already come with Nvidia drivers set up. However, you will need to label the nodes with the GPU type. Use the SkyPilot node labelling tool to do so:

    python -m sky.utils.kubernetes.gpu_labeler
    

    This will create a job on each node to read the GPU type from nvidia-smi and assign a skypilot.co/accelerator label to the node. You can check the status of these jobs by running:

    kubectl get jobs -n kube-system
    
  4. Verify your kubernetes cluster is correctly set up for SkyPilot by running sky check:

    $ sky check
    
  5. [If using GPUs] Check available GPUs in the kubernetes cluster with sky show-gpus --cloud kubernetes

    $ sky show-gpus --cloud kubernetes
    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
    A100  1, 2          4           2
    

Deploying on on-prem clusters#

You can also deploy Kubernetes on your on-prem clusters using off-the-shelf tools, such as kubeadm, k3s or Rancher. Please follow their respective guides to deploy your Kubernetes cluster.

Notes for specific Kubernetes distributions#

Some Kubernetes distributions require additional steps to set up GPU support.

Rancher Kubernetes Engine 2 (RKE2)#

Nvidia GPU operator installation on RKE2 through helm requires extra flags to set nvidia as the default runtime for containerd.

$ helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
    --set 'toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl' \
    --set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
    --set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
    --set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
    --set 'toolkit.env[2].value=nvidia' \
    --set 'toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT' \
    --set-string 'toolkit.env[3].value=true'

Refer to instructions on Nvidia GPU Operator installation with Helm on RKE2 for details.

K3s#

Installing Nvidia GPU operator on K3s is similar to RKE2 instructions from Nvidia, but requires changing the CONTAINERD_CONFIG variable to /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl. Here is an example command to install the Nvidia GPU operator on K3s:

$ helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --set 'toolkit.env[0].name=CONTAINERD_CONFIG' \
    --set 'toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml' \
    --set 'toolkit.env[1].name=CONTAINERD_SOCKET' \
    --set 'toolkit.env[1].value=/run/k3s/containerd/containerd.sock' \
    --set 'toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS' \
    --set 'toolkit.env[2].value=nvidia'

Check the status of the GPU operator installation by running kubectl get pods -n gpu-operator. It takes a few minutes to install and some CrashLoopBackOff errors are expected during the installation process.

Tip

If your gpu-operator installation stays stuck in CrashLoopBackOff, you may need to create a symlink to the ldconfig binary to work around a known issue with nvidia-docker runtime. Run the following command on your nodes:

$ ln -s /sbin/ldconfig /sbin/ldconfig.real

After the GPU operator is installed, create the nvidia RuntimeClass required by K3s. This runtime class will automatically be used by SkyPilot to schedule GPU pods:

$ kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF