Kubernetes Cluster Setup#

Note

This is a guide for cluster administrators on how to set up Kubernetes clusters for use with SkyPilot.

If you are a SkyPilot user and your cluster administrator has already set up a cluster and shared a kubeconfig file with you, Submitting tasks to Kubernetes explains how to submit tasks to your cluster.

⚙️ Setup Kubernetes Cluster

Configure your Kubernetes cluster to run SkyPilot.

✅️ Verify Setup

Ensure your cluster is set up correctly for SkyPilot.

👀️ Observability

Use your existing Kubernetes tooling to monitor SkyPilot resources.

Setting up Kubernetes cluster for SkyPilot#

To prepare a Kubernetes cluster to run SkyPilot, the cluster administrator must:

  1. Deploy a cluster running Kubernetes v1.20 or later.

  2. Set up GPU support.

  3. [Optional] Set up ports for exposing services.

  4. [Optional] Set up permissions: create a namespace for your users and/or create a service account with minimal permissions for SkyPilot.

After these steps, the administrator can share the kubeconfig file with users, who can then submit tasks to the cluster using SkyPilot.

Step 1 - Deploy a Kubernetes Cluster#

Tip

If you already have a Kubernetes cluster, skip this step.

Below we link to minimal guides to set up a new Kubernetes cluster in different environments, including hosted services on the cloud.

Local Development Cluster

Run a local Kubernetes cluster on your laptop with sky local up.

On-prem Clusters (RKE2, K3s, etc.)

For on-prem deployments with kubeadm, RKE2, K3s or other distributions.

Google Cloud - GKE

Google’s hosted Kubernetes service.

Amazon - EKS

Amazon’s hosted Kubernetes service.

Step 2 - Set up GPU support#

To utilize GPUs on Kubernetes, your cluster must:

  1. Have the nvidia.com/gpu resource available on all GPU nodes and have nvidia as the default runtime for your container engine.

  2. Have a label on each node specifying the GPU type. See Setting up GPU labels for more details.

Tip

To verify the Nvidia GPU Operator is installed after step 1 and the nvidia runtime is set as default, run:

$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
$ watch kubectl get pods
# If the pod status changes to completed after a few minutes, Nvidia GPU driver is set up correctly. Move on to setting up GPU labels.

Note

Refer to Notes for specific Kubernetes distributions for additional instructions on setting up GPU support on specific Kubernetes distributions, such as RKE2 and K3s.

Setting up GPU labels#

Tip

If your cluster has the Nvidia GPU Operator installed or you are using GKE or Karpenter, your cluster already has the necessary GPU labels. You can skip this section.

To use GPUs with SkyPilot, cluster nodes must be labelled with the GPU type. This informs SkyPilot which GPU types are available on the cluster.

Currently supported labels are:

  • nvidia.com/gpu.product: automatically created by Nvidia GPU Operator.

  • cloud.google.com/gke-accelerator: used by GKE clusters.

  • karpenter.k8s.aws/instance-gpu-name: used by Karpenter.

  • skypilot.co/accelerator: custom label used by SkyPilot if none of the above are present.

Any one of these labels is sufficient for SkyPilot to detect GPUs on the cluster.

Tip

To check if your nodes contain the necessary labels, run:

output=$(kubectl get nodes --show-labels | awk -F'[, ]' '{for (i=1; i<=NF; i++) if ($i ~ /nvidia.com\/gpu.product=|cloud.google.com\/gke-accelerator=|karpenter.k8s.aws\/instance-gpu-name=|skypilot.co\/accelerator=/) print $i}')
if [ -z "$output" ]; then
  echo "No valid GPU labels found."
else
  echo "GPU Labels found:"
  echo "$output"
fi

Automatically Labelling Nodes#

If none of the above labels are present on your cluster, we provide a convenience script that automatically detects GPU types and labels each node with the skypilot.co/accelerator label. You can run it with:

$ python -m sky.utils.kubernetes.gpu_labeler

Created GPU labeler job for node ip-192-168-54-76.us-west-2.compute.internal
Created GPU labeler job for node ip-192-168-93-215.us-west-2.compute.internal
GPU labeling started - this may take 10 min or more to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs --namespace=kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`.

Note

If the GPU labelling process fails, you can run python -m sky.utils.kubernetes.gpu_labeler --cleanup to clean up the failed jobs.

Manually Labelling Nodes#

You can also manually label nodes, if required. Labels must be of the format skypilot.co/accelerator: <gpu_name> where <gpu_name> is the lowercase name of the GPU.

For example, a node with H100 GPUs must have a label skypilot.co/accelerator: h100.

Use the following command to label a node:

kubectl label nodes <node-name> skypilot.co/accelerator=<gpu_name>

Note

GPU labels are case-sensitive. Ensure that the GPU name is lowercase if you are using the skypilot.co/accelerator label.

[Optional] Step 3 - Set up for Exposing Services#

Tip

If you are using GKE or EKS or do not plan expose ports publicly on Kubernetes (such as sky launch --ports, SkyServe), no additional setup is required. On GKE and EKS, SkyPilot will create a LoadBalancer service automatically.

Running SkyServe or tasks exposing ports requires additional setup to expose ports running services. SkyPilot supports either of two modes to expose ports:

Refer to Exposing Services on Kubernetes for more details.

[Optional] Step 4 - Namespace and Service Account Setup#

Tip

This step is optional and required only in specific environments. By default, SkyPilot runs in the namespace configured in current kube-context and creates a service account named skypilot-service-account to run tasks. This step is not required if you use these defaults.

If your cluster requires isolating SkyPilot tasks to a specific namespace and restricting the permissions granted to users, you can create a new namespace and service account for SkyPilot to use.

The minimal permissions required for the service account can be found on the Minimal Kubernetes Permissions page.

To simplify the setup, we provide a script that creates a namespace and service account with the necessary permissions for a given service account name and namespace.

# Download the script
wget https://raw.githubusercontent.com/skypilot-org/skypilot/master/sky/utils/kubernetes/generate_kubeconfig.sh
chmod +x generate_kubeconfig.sh

# Execute the script to generate a kubeconfig file with the service account and namespace
# Replace my-sa and my-namespace with your desired service account name and namespace
# The script will create the namespace if it does not exist and create a service account with the necessary permissions.
SKYPILOT_SA_NAME=my-sa SKYPILOT_NAMESPACE=my-namespace ./generate_kubeconfig.sh

You may distribute the generated kubeconfig file to users who can then use it to submit tasks to the cluster.

Verifying Setup#

Once the cluster is deployed and you have placed your kubeconfig at ~/.kube/config, verify your setup by running sky check:

sky check kubernetes

This should show Kubernetes: Enabled without any warnings.

You can also check the GPUs available on your nodes by running:

$ sky show-gpus --cloud k8s
Kubernetes GPUs
GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
L4    1, 2, 4                   12          12
H100  1, 2, 4, 8                16          16

Kubernetes per node GPU availability
NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
my-cluster-0               L4        4           4
my-cluster-1               L4        4           4
my-cluster-2               L4        2           2
my-cluster-3               L4        2           2
my-cluster-4               H100      8           8
my-cluster-5               H100      8           8

Observability for Administrators#

All SkyPilot tasks are run in pods inside a Kubernetes cluster. As a cluster administrator, you can inspect running pods (e.g., with kubectl get pods -n namespace) to check which tasks are running and how many resources they are consuming on the cluster.

Below, we provide tips on how to monitor SkyPilot resources on your Kubernetes cluster.

List SkyPilot resources across all users#

We provide a convenience command, sky status --k8s, to view the status of all SkyPilot resources in the cluster.

Unlike sky status which lists only the SkyPilot resources launched by the current user, sky status --k8s lists all SkyPilot resources in the cluster across all users.

$ sky status --k8s
Kubernetes cluster state (context: mycluster)
SkyPilot clusters
USER     NAME                           LAUNCHED    RESOURCES                                  STATUS
alice    infer-svc-1                    23 hrs ago  1x Kubernetes(cpus=1, mem=1, {'L4': 1})    UP
alice    sky-jobs-controller-80b50983   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
alice    sky-serve-controller-80b50983  23 hrs ago  1x Kubernetes(cpus=4, mem=4)               UP
bob      dev                            1 day ago   1x Kubernetes(cpus=2, mem=8, {'H100': 1})  UP
bob      multinode-dev                  1 day ago   2x Kubernetes(cpus=2, mem=2)               UP
bob      sky-jobs-controller-2ea485ea   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP

Managed jobs
In progress tasks: 1 STARTING
USER     ID  TASK  NAME      RESOURCES   SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
alice    1   -     eval      1x[CPU:1+]  2 days ago  49s            8s            0            SUCCEEDED
bob      4   -     pretrain  1x[H100:4]  1 day ago   1h 1m 11s      1h 14s        0            SUCCEEDED
bob      3   -     bigjob    1x[CPU:16]  1 day ago   1d 21h 11m 4s  -             0            STARTING
bob      2   -     failjob   1x[CPU:1+]  1 day ago   54s            9s            0            FAILED
bob      1   -     shortjob  1x[CPU:1+]  2 days ago  1h 1m 19s      1h 16s        0            SUCCEEDED

Kubernetes Dashboard#

You can deploy tools such as the Kubernetes dashboard to easily view and manage SkyPilot resources on your cluster.

Kubernetes Dashboard

As a demo, we provide a sample Kubernetes dashboard deployment manifest that you can deploy with:

$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/scripts/dashboard.yaml

To access the dashboard, run:

$ kubectl proxy

In a browser, open http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ and click on Skip when prompted for credentials.

Note that this dashboard can only be accessed from the machine where the kubectl proxy command is executed.

Note

The demo dashboard is not secure and should not be used in production. Please refer to the Kubernetes documentation for more information on how to set up access control for the dashboard.

Troubleshooting Kubernetes Setup#

If you encounter issues while setting up your Kubernetes cluster, please refer to the troubleshooting guide to diagnose and fix issues.