Multiple Kubernetes Clusters#

SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters through a single pane of glass.

You may have multiple Kubernetes clusters for different:

  • Use cases: e.g., a production cluster and a development/testing cluster.

  • Regions or clouds: e.g., US and EU regions; or AWS and Lambda clouds.

  • Accelerators: e.g., NVIDIA H100 cluster and a Google TPU cluster.

  • Configurations: e.g., a small cluster for a single node and a large cluster for multiple nodes.

  • Kubernetes versions: e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.

../../_images/multi-kubernetes.svg

Configuration#

Step 1: Set up credentials#

To work with multiple Kubernetes clusters, their credentials must be set up as individual contexts in your local ~/.kube/config file.

For deploying new clusters and getting credentials, see Deployment Guides.

For example, a ~/.kube/config file may look like this:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:
    ...
    server: https://xx.xx.xx.xx:45819
  name: my-h100-cluster
- cluster:
    certificate-authority-data:
    ...
    server: https://yy.yy.yy.yy:45819
  name: my-tpu-cluster
contexts:
- context:
    cluster: my-h100-cluster
    user: my-h100-cluster
  name: my-h100-cluster
- context:
    cluster: my-tpu-cluster
    namespace: my-namespace
    user: my-tpu-cluster
  name: my-tpu-cluster
current-context: my-h100-cluster
...

In this example, we have two Kubernetes clusters: my-h100-cluster and my-tpu-cluster, and each Kubernetes cluster has a context for it.

Step 2: Set up SkyPilot to access multiple Kubernetes clusters#

Unlike clouds, SkyPilot does not failover through different Kubernetes clusters (regions) by default because each Kubernetes cluster can have a different purpose.

By default, SkyPilot only uses the context set in the current-context in the kubeconfig. You can get the current context with kubectl config current-context.

To allow SkyPilot to access multiple Kubernetes clusters, you can set the kubernetes.allowed_contexts in the SkyPilot global config, ~/.sky/config.yaml.

kubernetes:
  allowed_contexts:
    - my-h100-cluster
    - my-tpu-cluster

To check the enabled Kubernetes clusters, you can run sky check k8s.

$ sky check k8s

🎉 Enabled clouds 🎉
  ✔ Kubernetes
    Allowed contexts:
    ├── my-h100-cluster
    └── my-tpu-cluster

Failover across multiple Kubernetes clusters#

With the kubernetes.allowed_contexts config set, SkyPilot will failover through the Kubernetes clusters in the same order as they are specified in the field.

$ sky launch --gpus H100 --cloud k8s echo 'Hello World'

Considered resources (1 node):
------------------------------------------------------------------------------------------------------------
CLOUD        INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE           COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------
Kubernetes   2CPU--8GB--1H100   2       8         H100:1         my-h100-cluster-gke   0.00          ✔
Kubernetes   2CPU--8GB--1H100   2       8         H100:1         my-h100-cluster-eks   0.00
------------------------------------------------------------------------------------------------------------

Launching in a specific Kubernetes cluster#

SkyPilot uses the region field to denote a Kubernetes context. You can point to a Kubernetes cluster by specifying the --region with the context name for that cluster.

$ # Launch in a specific Kubernetes cluster.
$ sky launch --cloud k8s --region my-tpu-cluster echo 'Hello World'

$ # Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --cloud k8s --region my-h100-cluster

Kubernetes GPUs (Context: my-h100-cluster)
GPU    QTY_PER_NODE            TOTAL_GPUS  TOTAL_FREE_GPUS
H100   1, 2, 3, 4, 5, 6, 7, 8  8           8

Kubernetes per node GPU availability
NODE_NAME                                 GPU_NAME  TOTAL_GPUS  FREE_GPUS
my-h100-cluster-hbzn  H100      8           8
my-h100-cluster-w5x7  None      0           0

When launching a SkyPilot cluster or task, you can also specify the context name with --region to launch the cluster or task in.

Dynamically updating clusters to use#

You can configure SkyPilot to dynamically fetch Kubernetes cluster configs and enforce restrictions on which clusters are used. Refer to Dynamically update Kubernetes contexts to use for more.