Getting Started on Kubernetes#
Prerequisites#
To connect and use a Kubernetes cluster, SkyPilot needs:
An existing Kubernetes cluster running Kubernetes v1.20 or later.
A Kubeconfig file containing access credentials and namespace to be used.
Supported Kubernetes deployments:
Hosted Kubernetes services (EKS, GKE)
On-prem clusters (Kubeadm, Rancher, K3s)
Local development clusters (KinD, minikube)
In a typical workflow:
A cluster administrator sets up a Kubernetes cluster. Refer to admin guides for Kubernetes cluster setup for different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging).
Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig files containing their credentials (kube-context). SkyPilot reads this Kubeconfig file to communicate with the cluster.
Launching your first task#
Once your cluster administrator has setup a Kubernetes cluster and provided you with a kubeconfig file:
Make sure kubectl,
socat
andnc
(netcat) are installed on your local machine.$ # MacOS $ brew install kubectl socat netcat $ # Linux (may have socat already installed) $ sudo apt-get install kubectl socat netcat
Place your kubeconfig file at
~/.kube/config
.$ mkdir -p ~/.kube $ cp /path/to/kubeconfig ~/.kube/config
You can verify your credentials are setup correctly by running
kubectl get pods
.Note
If your cluster administrator has also provided you with a specific service account to use, set it in your
~/.sky/config.yaml
file:kubernetes: remote_identity: your-service-account-name
Run
sky check
and verify that Kubernetes is enabled in SkyPilot.$ sky check Checking credentials to enable clouds for SkyPilot. ... Kubernetes: enabled ...
Note
sky check
will also check if GPU support is available on your cluster. If GPU support is not available, it will show the reason. To setup GPU support on the cluster, refer to the Kubernetes cluster setup guide.
You can now run any SkyPilot task on your Kubernetes cluster.
$ sky launch --cpus 2+ task.yaml == Optimizer == Target: minimizing cost Estimated cost: $0.0 / hour Considered resources (1 node): --------------------------------------------------------------------------------------------------- CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN --------------------------------------------------------------------------------------------------- Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ AWS m6i.large 2 8 - us-east-1 0.10 Azure Standard_D2s_v5 2 8 - eastus 0.10 GCP n2-standard-2 2 8 - us-central1 0.10 IBM bx2-8x32 8 32 - us-east 0.38 Lambda gpu_1x_a10 30 200 A10:1 us-east-1 0.60 ---------------------------------------------------------------------------------------------------.
Note
SkyPilot will use the cluster and namespace set in the current-context
in the
kubeconfig file. To manage your current-context
:
$ # See current context
$ kubectl config current-context
$ # Switch current-context
$ kubectl config use-context mycontext
$ # Set a specific namespace to be used in the current-context
$ kubectl config set-context --current --namespace=mynamespace
Viewing cluster status#
To view the status of all SkyPilot resources in the Kubernetes cluster, run sky status --k8s
.
Unlike sky status
which lists only the SkyPilot resources launched by the current user,
sky status --k8s
lists all SkyPilot resources in the Kubernetes cluster across all users.
$ sky status --k8s
Kubernetes cluster state (context: mycluster)
SkyPilot clusters
USER NAME LAUNCHED RESOURCES STATUS
alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP
alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP
bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP
bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP
bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP
Managed jobs
In progress tasks: 1 STARTING
USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED
bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED
bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING
bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED
bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED
You can also inspect the real-time GPU usage on the cluster with sky show-gpus --cloud k8s
.
$ sky show-gpus --cloud k8s
Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 4
my-cluster-2 L4 2 2
my-cluster-3 L4 2 2
my-cluster-4 H100 8 8
my-cluster-5 H100 8 8
Using Custom Images#
By default, we maintain and use two SkyPilot container images for use on Kubernetes clusters:
us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot
: used for CPU-only clusters (Dockerfile).us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu
: used for GPU clusters (Dockerfile).
These images are pre-installed with SkyPilot dependencies for fast startup.
To use your own image, add image_id: docker:<your image tag>
to the resources
section of your task YAML.
resources:
image_id: docker:myrepo/myimage:latest
...
Your image must satisfy the following requirements:
Image must be debian-based and must have the apt package manager installed.
The default user in the image must have root privileges or passwordless sudo access.
Note
If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at Start streaming logs ...
. See GitHub issue for more.
Using Images from Private Repositories#
To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a secret in your Kubernetes cluster and edit your ~/.sky/config.yaml
to specify the secret like so:
kubernetes:
pod_config:
spec:
imagePullSecrets:
- name: your-secret-here
Tip
If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using k8s-ecr-login-renew to automatically refresh your secrets.
Opening Ports#
Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes:
LoadBalancer services (default)
One of these modes must be supported and configured on your cluster. Refer to the setting up ports on Kubernetes guide on how to do this.
Tip
On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed.
Once your cluster is configured, launch a task which exposes services on a port by adding ports
to the resources
section of your task YAML.
# task.yaml
resources:
ports: 8888
run: |
python -m http.server 8888
After launching the cluster with sky launch -c myclus task.yaml
, you can get the URL to access the port using sky status --endpoints myclus
.
# List all ports exposed by the cluster
$ sky status --endpoints myclus
8888: 34.173.13.241:8888
# curl a specific port's endpoint
$ curl $(sky status --endpoint 8888 myclus)
...
Tip
To learn more about opening ports in SkyPilot tasks, see Opening Ports.
FAQs#
Can I use multiple Kubernetes clusters with SkyPilot?
SkyPilot can work with multiple Kubernetes contexts set in your kubeconfig file. By default, SkyPilot will use the current active context. To use a different context, change your current context using
kubectl config use-context <context-name>
.If you would like to use multiple contexts seamlessly during failover, check out the
allowed_contexts
feature in Advanced Configurations.Are autoscaling Kubernetes clusters supported?
To run on autoscaling clusters, set the
provision_timeout
key in~/.sky/config.yaml
to a large value to give enough time for the cluster autoscaler to provision new nodes. This will direct SkyPilot to wait for the cluster to scale up before failing over to the next candidate resource (e.g., next cloud).If you are using GPUs in a scale-to-zero setting, you should also set the
autoscaler
key to the autoscaler type of your cluster. More details in Advanced Configurations.# ~/.sky/config.yaml kubernetes: provision_timeout: 900 # Wait 15 minutes for nodes to get provisioned before failover. Set to -1 to wait indefinitely. autoscaler: gke # [gke, karpenter, generic]; required if using GPUs in scale-to-zero setting
Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?
The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster.
I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?
For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks.
How do I view the pods created by SkyPilot on my Kubernetes cluster?
You can use your existing observability tools to filter resources with the label
parent=skypilot
(kubectl get pods -l 'parent=skypilot'
). As an example, follow the instructions here to deploy the Kubernetes Dashboard on your cluster.How can I specify custom configuration for the pods created by SkyPilot?
You can override the pod configuration used by SkyPilot by setting the
pod_config
key in~/.sky/config.yaml
. The value ofpod_config
should be a dictionary that follows the Kubernetes Pod API.For example, to set custom environment variables and attach a volume on your pods, you can add the following to your
~/.sky/config.yaml
file:kubernetes: pod_config: spec: containers: - env: - name: MY_ENV_VAR value: MY_ENV_VALUE volumeMounts: # Custom volume mounts for the pod - mountPath: /foo name: example-volume resources: # Custom resource requests and limits requests: rdma/rdma_shared_device_a: 1 limits: rdma/rdma_shared_device_a: 1 volumes: - name: example-volume hostPath: path: /tmp type: Directory
For more details refer to Advanced Configurations.
I am using a custom image. How can I speed up the pod startup time?
You can pre-install SkyPilot dependencies in your custom image to speed up the pod startup time. Simply add these lines at the end of your Dockerfile:
FROM <your base image> # Install system dependencies RUN apt update -y && \ apt install git gcc rsync sudo patch openssh-server pciutils fuse unzip socat netcat-openbsd curl -y && \ rm -rf /var/lib/apt/lists/* # Install conda and other python dependencies RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -o Miniconda3-Linux-x86_64.sh && \ bash Miniconda3-Linux-x86_64.sh -b && \ eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \ grep "# >>> conda initialize >>>" ~/.bashrc || { conda init && source ~/.bashrc; } && \ rm Miniconda3-Linux-x86_64.sh && \ export PIP_DISABLE_PIP_VERSION_CHECK=1 && \ python3 -m venv ~/skypilot-runtime && \ PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \ $PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \ $PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \ curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \ sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \ echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc