Multiple Kubernetes Clusters#
SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters through a single pane of glass.
You may have multiple Kubernetes clusters for different:
Use cases: e.g., a production cluster and a development/testing cluster.
Regions or clouds: e.g., US and EU regions; or AWS and Lambda clouds.
Accelerators: e.g., NVIDIA H100 cluster and a Google TPU cluster.
Configurations: e.g., a small cluster for a single node and a large cluster for multiple nodes.
Kubernetes versions: e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.
Configuration#
Step 1: Set up credentials#
To work with multiple Kubernetes clusters, their credentials must be set up as individual contexts in your local ~/.kube/config
file.
For deploying new clusters and getting credentials, see Deployment Guides.
For example, a ~/.kube/config
file may look like this:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
...
server: https://xx.xx.xx.xx:45819
name: my-h100-cluster
- cluster:
certificate-authority-data:
...
server: https://yy.yy.yy.yy:45819
name: my-tpu-cluster
contexts:
- context:
cluster: my-h100-cluster
user: my-h100-cluster
name: my-h100-cluster
- context:
cluster: my-tpu-cluster
namespace: my-namespace
user: my-tpu-cluster
name: my-tpu-cluster
current-context: my-h100-cluster
...
In this example, we have two Kubernetes clusters: my-h100-cluster
and my-tpu-cluster
, and each Kubernetes cluster has a context for it.
Step 2: Set up SkyPilot to access multiple Kubernetes clusters#
Unlike clouds, SkyPilot does not failover through different Kubernetes clusters (regions) by default because each Kubernetes cluster can have a different purpose.
By default, SkyPilot only uses the context set in the current-context
in the
kubeconfig. You can get the current context with kubectl config
current-context
.
To allow SkyPilot to access multiple Kubernetes clusters, you can set the
kubernetes.allowed_contexts
in the SkyPilot global config, ~/.sky/config.yaml
.
kubernetes:
allowed_contexts:
- my-h100-cluster
- my-tpu-cluster
To check the enabled Kubernetes clusters, you can run sky check k8s
.
$ sky check k8s
🎉 Enabled clouds 🎉
✔ Kubernetes
Allowed contexts:
├── my-h100-cluster
└── my-tpu-cluster
To check GPUs available in a Kubernetes cluster, you can run sky show-gpus --cloud k8s
.
$ sky show-gpus --cloud k8s
Kubernetes GPUs
GPU UTILIZATION
H100 16 of 16 free
A100 8 of 8 free
Context: my-h100-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 16 of 16 free
Context: kind-skypilot
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
A100 1, 2, 4, 8 8 of 8 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
my-h100-cluster gke-skypilotalpha-default-pool-ff931856-6uvd - 0 of 0 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-1usy H100 8 of 8 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-4rxa H100 8 of 8 free
kind-skypilot skypilot-control-plane A100 8 of 8 free
Failover across multiple Kubernetes clusters#
With the kubernetes.allowed_contexts
config set, SkyPilot will failover
through the Kubernetes clusters in the same order as they are specified in the field.
$ sky launch --gpus H100 --cloud k8s echo 'Hello World'
Considered resources (1 node):
------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-gke 0.00 ✔
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 my-h100-cluster-eks 0.00
------------------------------------------------------------------------------------------------------------
Launching in a specific Kubernetes cluster#
SkyPilot uses the region
field to denote a Kubernetes context. You can point to a Kubernetes cluster
by specifying the --region
with the context name for that cluster.
$ # Launch in a specific Kubernetes cluster.
$ sky launch --cloud k8s --region my-tpu-cluster echo 'Hello World'
$ # Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --cloud k8s --region my-h100-cluster ✠✱
Kubernetes GPUs
Context: my-h100-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 16 of 16 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
my-h100-cluster gke-skypilotalpha-default-pool-ff931856-6uvd - 0 of 0 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-1usy H100 8 of 8 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-4rxa H100 8 of 8 free
When launching a SkyPilot cluster or task, you can also specify the context name with --region
to launch the cluster or task in.
Dynamically updating clusters to use#
You can configure SkyPilot to dynamically fetch Kubernetes cluster configs and enforce restrictions on which clusters are used. Refer to Dynamically update Kubernetes contexts to use for more.