Kubernetes Troubleshooting#
If you’re unable to run SkyPilot tasks on your Kubernetes cluster, this guide will help you debug common issues.
If this guide does not help resolve your issue, please reach out to us on Slack or GitHub.
Verifying basic setup#
Step A0 - Is Kubectl functional?#
Are you able to run kubectl get nodes
without any errors?
$ kubectl get nodes
# This should list all the nodes in your cluster.
Make sure at least one node is in the Ready
state.
If you see an error, ensure that your kubeconfig file at ~/.kube/config
is correctly set up.
Note
The kubectl
command should not require any additional flags or environment variables to run.
If it requires additional flags, you must encode all configuration in your kubeconfig file at ~/.kube/config
.
For example, --context
, --token
, --certificate-authority
, etc. should all be configured directly in the kubeconfig file.
Step A1 - Can you create pods and services?#
As a sanity check, we will now try creating a simple pod running a HTTP server and a service to verify that your cluster and it’s networking is functional.
We will use the SkyPilot default image us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest
to verify that the image can be pulled from the registry.
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
# Verify that the pod is running by checking the status of the pod
$ kubectl get pod skytest
# Try accessing the HTTP server in the pod by port-forwarding it to your local machine
$ kubectl port-forward svc/skytest-svc 8080:8080
# Open a browser and navigate to http://localhost:8080 to see an index page
# Once you have verified that the pod is running, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
If your pod does not start, check the pod’s logs for errors with kubectl describe skytest
and kubectl logs skytest
.
Step A2 - Can SkyPilot access your cluster?#
Run sky check
to verify that SkyPilot can access your cluster.
$ sky check
# Should show `Kubernetes: Enabled`
If you see an error, ensure that your kubeconfig file at ~/.kube/config
is correctly set up.
Step A3 - Do your nodes have enough disk space?#
If your nodes are out of disk space, pulling the SkyPilot images may fail with rpc error: code = Canceled desc = failed to pull and unpack image: context canceled
error in the terminal during provisioning.
Make sure your nodes are not under disk pressure by checking Conditions
in kubectl describe nodes
, or by running:
$ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .status.conditions[?(@.type=="DiskPressure")]}{.type}={.status}{"\n"}{end}{"\n"}{end}'
# Should not show DiskPressure=True for any node
Step A4 - Can you launch a SkyPilot task?#
Next, try running a simple hello world task to verify that SkyPilot can launch tasks on your cluster.
$ sky launch -y -c mycluster --cloud k8s -- "echo hello world"
# Task should run and print "hello world" to the console
# Once you have verified that the task runs, you can delete it
$ sky down -y mycluster
If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output, starting with “To view detailed progress: …”.
Checking GPU support#
If you are trying to run a GPU task, make sure you have followed the instructions in Step 2 - Set up GPU support to set up your cluster for GPU support.
In this section, we will verify that your cluster has GPU support and that SkyPilot can access it.
Step B0 - Is your cluster GPU-enabled?#
Run kubectl describe nodes
or the below snippet to verify that your nodes have nvidia.com/gpu
resources.
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity}'
# Look for the `nvidia.com/gpu` field under resources in the output. It should show the number of GPUs available for each node.
If you do not see the nvidia.com/gpu
field, your cluster likely does not have the Nvidia GPU operator installed.
Please follow the instructions in Step 2 - Set up GPU support to install the Nvidia GPU operator.
Note that GPU operator installation can take several minutes, and you may see 0 capacity for nvidia.com/gpu
resources until the installation is complete.
Tip
If you are using GKE, refer to Deploying on Google Cloud GKE to install the appropriate drivers.
Step B1 - Can you run a GPU pod?#
Verify if GPU operator is installed and the nvidia
runtime is set as default by running:
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
# Verify that the pod is running by checking the status of the pod
$ kubectl get pod skygputest
$ kubectl logs skygputest
# Should print the nvidia-smi output to the console
# Once you have verified that the pod is running, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/gpu_test_pod.yaml
If the pod status is pending, make the nvidia.com/gpu
resources available on your nodes in the previous step. You can debug further by running kubectl describe pod skygputest
.
If the logs show nvidia-smi: command not found, likely the nvidia
runtime is not set as default. Please install the Nvidia GPU operator and make sure the nvidia
runtime is set as default.
For example, for RKE2, refer to instructions on Nvidia GPU Operator installation with Helm on RKE2 to set the nvidia
runtime as default.
Step B2 - Are your nodes labeled correctly?#
SkyPilot requires nodes to be labeled with the correct GPU type to run GPU tasks. Run kubectl get nodes -o json
to verify that your nodes are labeled correctly.
Tip
If you are using GKE, your nodes should be automatically labeled with cloud.google.com/gke-accelerator
. You can skip this step.
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, labels: .metadata.labels}'
# Look for the `skypilot.co/accelerator` label in the output. It should show the GPU type for each node.
If you do not see the skypilot.co/accelerator label, your nodes are not labeled correctly. Please follow the instructions in Step 2 - Set up GPU support to label your nodes.
Step B3 - Can SkyPilot see your GPUs?#
Run sky check
to verify that SkyPilot can see your GPUs.
$ sky check
# Should show `Kubernetes: Enabled` and should not print any warnings about GPU support.
# List the available GPUs in your cluster
$ sky show-gpus --cloud k8s
Step B4 - Try launching a dummy GPU task#
Next, try running a simple GPU task to verify that SkyPilot can launch GPU tasks on your cluster.
# Replace the GPU type from the sky show-gpus output in the task launch command
$ sky launch -y -c mygpucluster --cloud k8s --gpu <gpu-type>:1 -- "nvidia-smi"
# Task should run and print the nvidia-smi output to the console
# Once you have verified that the task runs, you can delete it
$ sky down -y mygpucluster
If your task does not run, check the terminal and provisioning logs for errors. Path to provisioning logs can be found at the start of the SkyPilot output, starting with “To view detailed progress: …”.
Verifying ports support#
If you are trying to run a task that requires ports to be opened, make sure you have followed the instructions in :ref:_kubernetes-ports` to configure SkyPilot and your cluster to use the desired method (LoadBalancer service or Nginx Ingress) for port support.
In this section, we will first verify that your cluster has ports support and services launched by SkyPilot can be accessed.
Step C0 - Verifying LoadBalancer service setup#
If you are using LoadBalancer services for ports support, follow the below steps to verify that your cluster is configured correctly.
Tip
If you are using Nginx Ingress for ports support, skip to Step C0 - Verifying Nginx Ingress setup.
Does your cluster support LoadBalancer services?#
To verify that your cluster supports LoadBalancer services, we will create an example service and verify that it gets an external IP.
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
# Verify that the service gets an external IP
# Note: It may take some time on cloud providers to change from pending to an external IP
$ watch kubectl get svc skytest-loadbalancer
# Once you get an IP, try accessing the HTTP server by curling the external IP
$ IP=$(kubectl get svc skytest-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
$ curl $IP:8080
# Once you have verified that the service is accessible, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/loadbalancer_test_svc.yaml
If your service does not get an external IP, check the service’s status with kubectl describe svc skytest-loadbalancer
. Your cluster may not support LoadBalancer services.
Step C0 - Verifying Nginx Ingress setup#
If you are using Nginx Ingress for ports support, refer to Nginx Ingress for instructions on how to install and configure Nginx Ingress.
Tip
If you are using LoadBalancer services for ports support, you can skip this section.
Does your cluster support Nginx Ingress?#
To verify that your cluster supports Nginx Ingress, we will create an example ingress.
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl apply -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test.yaml
# Get the external IP of the ingress using the externalIPs field or the loadBalancer field
$ IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.spec.externalIPs[*]}') && [ -z "$IP" ] && IP=$(kubectl get service ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
$ echo "Got IP: $IP"
$ curl http://$IP/skytest
# Once you have verified that the service is accessible, you can delete it
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/cpu_test_pod.yaml
$ kubectl delete -f https://raw.githubusercontent.com/skypilot-org/skypilot/master/tests/kubernetes/ingress_test_svc.yaml
If your IP is not acquired, check the service’s status with kubectl describe svc ingress-nginx-controller -n ingress-nginx
.
Your ingress’s service must be of type LoadBalancer
or NodePort
and must have an external IP.
Is SkyPilot configured to use Nginx Ingress?#
Take a look at your ~/.sky/config.yaml
file to verify that the ports: ingress
section is configured correctly.
$ cat ~/.sky/config.yaml
# Output should contain:
#
# kubernetes:
# ports: ingress
If not, add the ports: ingress
section to your ~/.sky/config.yaml
file.
Step C1 - Verifying SkyPilot can launch services#
Next, try running a simple task with a service to verify that SkyPilot can launch services on your cluster.
$ sky launch -y -c myserver --cloud k8s --ports 8080 -- "python -m http.server 8080"
# Obtain the endpoint of the service
$ sky status --endpoint 8080 myserver
# Try curling the endpoint to verify that the service is accessible
$ curl <endpoint>
If you are unable to get the endpoint from SkyPilot,
consider running kubectl describe services
or kubectl describe ingress
to debug it.