Multiple Kubernetes Clusters#
SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters through a single pane of glass.
You may have multiple Kubernetes clusters for different:
Use cases: e.g., a production cluster and a development/testing cluster.
Regions or clouds: e.g., US and EU regions; or AWS and Lambda clouds.
Accelerators: e.g., NVIDIA H100 cluster and a Google TPU cluster.
Configurations: e.g., a small cluster for a single node and a large cluster for multiple nodes.
Kubernetes versions: e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.
Configuration#
Step 1: Set up credentials#
To work with multiple Kubernetes clusters, their credentials must be set up as individual contexts in your local ~/.kube/config
file.
For deploying new clusters and getting credentials, see Deployment Guides.
For example, a ~/.kube/config
file may look like this:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
...
server: https://xx.xx.xx.xx:45819
name: my-h100-cluster
- cluster:
certificate-authority-data:
...
server: https://yy.yy.yy.yy:45819
name: my-tpu-cluster
contexts:
- context:
cluster: my-h100-cluster
user: my-h100-cluster
name: my-h100-cluster
- context:
cluster: my-tpu-cluster
namespace: my-namespace
user: my-tpu-cluster
name: my-tpu-cluster
current-context: my-h100-cluster
...
In this example, we have two Kubernetes clusters: my-h100-cluster
and my-tpu-cluster
, and each Kubernetes cluster has a context for it.
Step 2: Set up SkyPilot to access multiple Kubernetes clusters#
Unlike clouds, SkyPilot does not failover through different Kubernetes clusters (regions) by default because each Kubernetes cluster can have a different purpose.
By default, SkyPilot only uses the context set in the current-context
in the
kubeconfig. You can get the current context with kubectl config
current-context
.
To allow SkyPilot to access multiple Kubernetes clusters, you can set the
kubernetes.allowed_contexts
in the SkyPilot global config, ~/.sky/config.yaml
.
kubernetes:
allowed_contexts:
- my-h100-cluster
- my-tpu-cluster
To check the enabled Kubernetes clusters, you can run sky check k8s
.
$ sky check k8s
🎉 Enabled clouds 🎉
✔ Kubernetes
Allowed contexts:
├── my-h100-cluster
└── my-tpu-cluster
Failover across multiple Kubernetes clusters#
With the kubernetes.allowed_contexts
config set, SkyPilot will failover
through the Kubernetes clusters in the same order as they are specified in the field.
$ sky launch --gpus H100 --cloud k8s echo 'Hello World'
Considered resources (1 node):
------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-gke 0.00 ✔
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-eks 0.00
------------------------------------------------------------------------------------------------------------
Launching in a specific Kubernetes cluster#
SkyPilot uses the region
field to denote a Kubernetes context. You can point to a Kubernetes cluster
by specifying the --region
with the context name for that cluster.
$ # Launch in a specific Kubernetes cluster.
$ sky launch --cloud k8s --region my-tpu-cluster echo 'Hello World'
$ # Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --cloud k8s --region my-h100-cluster
Kubernetes GPUs (Context: my-h100-cluster)
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 3, 4, 5, 6, 7, 8 8 8
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-h100-cluster-hbzn H100 8 8
my-h100-cluster-w5x7 None 0 0
When launching a SkyPilot cluster or task, you can also specify the context name with --region
to launch the cluster or task in.
Dynamically updating clusters to use#
You can configure SkyPilot to dynamically fetch Kubernetes cluster configs and enforce restrictions on which clusters are used. Refer to Dynamically update Kubernetes contexts to use for more.