Multiple Kubernetes Clusters#
SkyPilot allows you to manage dev pods, jobs and services across multiple Kubernetes clusters through a single pane of glass.
You may have multiple Kubernetes clusters for different:
Use cases: e.g., a production cluster and a development/testing cluster.
Regions or clouds: e.g., US and EU regions; or AWS and Lambda clouds.
Accelerators: e.g., NVIDIA H100 cluster and a Google TPU cluster.
Configurations: e.g., a small cluster for a single node and a large cluster for multiple nodes.
Kubernetes versions: e.g., to upgrade a cluster from Kubernetes 1.20 to 1.21, you may create a new Kubernetes cluster to avoid downtime or unexpected errors.
Configuration#
Step 1: Set up credentials#
To work with multiple Kubernetes clusters, their credentials must be set up as individual contexts in your local ~/.kube/config file.
For deploying new clusters and getting credentials, see Deployment Guides.
For example, a ~/.kube/config file may look like this:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
...
server: https://xx.xx.xx.xx:45819
name: my-h100-cluster
- cluster:
certificate-authority-data:
...
server: https://yy.yy.yy.yy:45819
name: my-tpu-cluster
contexts:
- context:
cluster: my-h100-cluster
user: my-h100-cluster
name: my-h100-cluster
- context:
cluster: my-tpu-cluster
namespace: my-namespace
user: my-tpu-cluster
name: my-tpu-cluster
current-context: my-h100-cluster
...
In this example, we have two Kubernetes clusters: my-h100-cluster and my-tpu-cluster, and each Kubernetes cluster has a context for it.
Step 2: Set up SkyPilot to access multiple Kubernetes clusters#
Unlike clouds, SkyPilot does not failover through different Kubernetes clusters (regions) by default because each Kubernetes cluster can have a different purpose.
By default, SkyPilot only uses the context set in the current-context in the
kubeconfig. You can get the current context with kubectl config
current-context.
To allow SkyPilot to access multiple Kubernetes clusters, you can set the
kubernetes.allowed_contexts in the SkyPilot global config, ~/.sky/config.yaml.
kubernetes:
allowed_contexts:
- my-h100-cluster
- my-tpu-cluster
To check the enabled Kubernetes clusters, you can run sky check k8s.
$ sky check k8s
🎉 Enabled clouds 🎉
✔ Kubernetes
Allowed contexts:
├── my-h100-cluster
└── my-tpu-cluster
To check GPUs available in a Kubernetes cluster, you can run sky show-gpus --infra k8s.
$ sky show-gpus --infra k8s
Kubernetes GPUs
GPU UTILIZATION
H100 16 of 16 free
A100 8 of 8 free
Context: my-h100-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 16 of 16 free
Context: kind-skypilot
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
A100 1, 2, 4, 8 8 of 8 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
my-h100-cluster gke-skypilotalpha-default-pool-ff931856-6uvd - 0 of 0 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-1usy H100 8 of 8 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-4rxa H100 8 of 8 free
kind-skypilot skypilot-control-plane A100 8 of 8 free
Failover across multiple Kubernetes clusters#
With the kubernetes.allowed_contexts config set, SkyPilot will failover
through the Kubernetes clusters in the same order as they are specified in the field.
$ sky launch --gpus H100 --infra k8s echo 'Hello World'
Considered resources (1 node):
---------------------------------------------------------------------------------------------------------
INFRA INSTANCE vCPUs Mem(GB) GPUS COST ($) CHOSEN
---------------------------------------------------------------------------------------------------------
Kubernetes (my-eks-cluster) 2CPU--2GB 2 2 - 0.00 ✔
Kubernetes (gke-skypilot) 4CPU--8GB 4 8 - 0.00
AWS (us-east-1) m6i.large 2 8 - 0.10
GCP (us-central1-a) n2-standard-2 2 8 - 0.10
---------------------------------------------------------------------------------------------------------
Launching in a specific Kubernetes cluster#
SkyPilot uses the infra field to denote a Kubernetes context. You can point to a Kubernetes cluster
by specifying the --infra with the context name for that cluster.
$ # Launch in a specific Kubernetes cluster.
$ sky launch --infra k8s/my-tpu-cluster echo 'Hello World'
$ # Check the GPUs available in a Kubernetes cluster
$ sky show-gpus --infra k8s/my-h100-cluster
Kubernetes GPUs
Context: my-h100-cluster
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
H100 1, 2, 4, 8 16 of 16 free
Kubernetes per-node GPU availability
CONTEXT NODE GPU UTILIZATION
my-h100-cluster gke-skypilotalpha-default-pool-ff931856-6uvd - 0 of 0 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-1usy H100 8 of 8 free
my-h100-cluster gke-skypilotalpha-largecpu-05dae726-4rxa H100 8 of 8 free
When launching a SkyPilot cluster or task, you can also specify the context name with --infra to launch the cluster or task in.
Per-context configuration#
When using multiple Kubernetes clusters, SkyPilot allows you to configure different settings for each Kubernetes context using the kubernetes.context_configs field in your ~/.sky/config.yaml.
This is useful when different clusters have different config requirements or capabilities.
You can specify per-context configurations for any Kubernetes config field, including:
pod_config: Custom pod specifications (labels, annotations, volume mounts, runtime class, etc.)remote_identity: Service account to use for the contextprovision_timeout: Timeout for provisioning pods if autoscaler is used
See Kubernetes config for the list of all fields supported.
Example configuration:
kubernetes:
# Global defaults for all contexts
provision_timeout: 10
allowed_contexts:
- my-h100-cluster
- dev-cluster
# Context-specific configurations
context_configs:
my-h100-cluster:
# Use NVIDIA runtime for H100 cluster
pod_config:
metadata:
labels:
cluster-type: production
remote_identity: h100-service-account # Use a custom service account for the cluster
# Development cluster with different proxy settings and volume mounts
dev-cluster:
pod_config:
spec:
imagePullSecrets:
- name: my-secret
containers:
- env:
- name: HTTP_PROXY
value: http://proxy-host:3128
volumeMounts:
- mountPath: /foo
name: example-volume
readOnly: true
volumes:
- name: example-volume
hostPath:
path: /tmp
type: Directory
provision_timeout: 3600 # Large timeout for autoscaler to provision nodes
autoscaler: gke
Dynamically updating clusters to use#
You can configure SkyPilot to dynamically fetch Kubernetes cluster configs and enforce restrictions on which clusters are used. Refer to Dynamically update Kubernetes contexts to use for more.