Upgrades and High Availability#

This page covers how to keep a remote SkyPilot API server resilient and up to date:

High availability#

The SkyPilot API server can be configured for high availability by making it fully stateless — backing it with an external PostgreSQL database decouples API server state from the API server pod, allowing the pod to be restarted, rescheduled, or upgraded (including via rolling updates) without losing state.

Note

Multi-replica API server deployments are not supported in open-source SkyPilot.

Tip

Scaling SkyPilot beyond 20 users or 1,000 GPUs, or need multi-replica high availability? We would love to talk to you. SkyPilot has been supporting teams with 200+ users and 10,000+ GPUs with high availability and up to 10× faster performance — sign up here.

Back the API server with a persistent database#

The API server can optionally be configured with a PostgreSQL database to persist state. It can be an externally managed database.

If a persistent DB is not specified, the API server uses a Kubernetes persistent volume to persist state.

Note

Database configuration must be set in the Helm deployment.

Configure PostgreSQL during the first Helm deployment using one of the two options below.

Option 1: Set the DB connection URI in helm values

Set apiService.dbConnectionString to postgresql://<username>:<password>@<host>:<port>/<database> in the helm values:

# --reuse-values keeps the Helm chart values set in the previous step
helm upgrade --install $RELEASE_NAME skypilot/skypilot-nightly --devel \
  --namespace $NAMESPACE \
  --reuse-values \
  --set apiService.dbConnectionString=postgresql://<username>:<password>@<host>:<port>/<database>

Option 2: Set the DB connection URI via Kubernetes secret

(available on nightly version 20250626 and later)

Create a Kubernetes secret that contains the DB connection URI:

kubectl create secret generic skypilot-db-connection-uri \
  --namespace $NAMESPACE \
  --from-literal connection_string=postgresql://<username>:<password>@<host>:<port>/<database>

When installing or upgrading the Helm chart, set the dbConnectionUri to the secret name:

helm upgrade --install $RELEASE_NAME skypilot/skypilot-nightly --devel \
  --namespace $NAMESPACE \
  --reuse-values \
  --set apiService.dbConnectionSecretName=skypilot-db-connection-uri

You can also directly set this value in the values.yaml file, e.g.:

apiService:
  dbConnectionSecretName: skypilot-db-connection-uri

Note

Once apiService.dbConnectionString or apiService.dbConnectionSecretName is specified, no other SkyPilot configuration can be specified in the helm chart. That is, apiService.config must be null. To set any other SkyPilot configuration, see Optional: Setting the SkyPilot config.

Upgrade API server deployed with Helm#

With Helm deployement, it is possible to upgrade the SkyPilot API server gracefully without causing client-side error with the steps below.

Step 1: Prepare an upgrade#

  1. Find the version to use in SkyPilot nightly build.

  2. Update SkyPilot helm repository to the latest version:

helm repo update skypilot
  1. Prepare versioning environment variables. NAMESPACE and RELEASE_NAME should be set to the currently installed namespace and release:

NAMESPACE=skypilot # TODO: change to your installed namespace
RELEASE_NAME=skypilot # TODO: change to your installed release name
VERSION=1.0.0-dev20250410 # TODO: change to the version you want to upgrade to
IMAGE_REPO=berkeleyskypilot/skypilot-nightly

Step 2: Upgrade the API server and clients#

Upgrade the clients:

pip install -U skypilot-nightly==${VERSION}

Upgrade the API server:

# --reuse-values is critical to keep the values set in the previous installation steps.
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot-nightly --devel --reuse-values \
  --set apiService.image=${IMAGE_REPO}:${VERSION}

When the API server is being upgraded, the SkyPilot CLI and Python SDK will automatically retry requests until the new version of the API server is started. So the upgrade process is graceful if the new version of the API server does not break API compatbility. For more details, refer to Graceful upgrade.

Optionally, you can watch the upgrade progress with:

$ kubectl get pod --namespace $NAMESPACE -l app=${RELEASE_NAME}-api --watch
NAME                                       READY   STATUS            RESTARTS   AGE
skypilot-demo-api-server-cf4896bdf-62c96   0/1     Init:0/2          0          7s
skypilot-demo-api-server-cf4896bdf-62c96   0/1     Init:1/2          0          24s
skypilot-demo-api-server-cf4896bdf-62c96   0/1     PodInitializing   0          26s
skypilot-demo-api-server-cf4896bdf-62c96   0/1     Running           0          27s
skypilot-demo-api-server-cf4896bdf-62c96   1/1     Running           0          50s

The upgraded API server is ready to serve requests after the pod becomes running and the READY column shows 1/1.

Note

apiService.config will be IGNORED during an upgrade. To update your SkyPilot config, see here.

Step 3: Verify the upgrade#

Verify the API server is able to serve requests and the version is consistent with the version you upgraded to:

$ sky api info
Using SkyPilot API server: <ENDPOINT>
├── Status: healthy, commit: 022a5c3ffe258f365764b03cb20fac70934f5a60, version: 1.0.0.dev20250410
└── User: aclice (abcd1234)

If possible, you can also trigger your pipelines that depend on the API server to verify there is no compatibility issue after the upgrade.

Upgrade the API server deployed on VM#

Note

VM deployment does not offer graceful upgrade. We recommend the Helm deployment Deploying SkyPilot API Server in production environments. The following is a workaround for upgrading SkyPilot API server in VM deployments.

Suppose the cluster name of the API server is api-server (which is used in the Alternative: Deploy on cloud VMs guide), you can upgrade the API server with the following steps:

  1. Get the version to upgrade to from SkyPilot nightly build.

  2. Switch to the original API server endpoint used to launch the cloud VM for API server. It is usually locally started when you ran sky launch -c api-server skypilot-api-server.yaml in Alternative: Deploy on cloud VMs guide:

# Replace http://localhost:46580 with the real API server endpoint if you were not using the local API server to launch the API server VM instance.
sky api login -e http://localhost:46580
  1. Check the API server VM instance is UP:

$ sky status api-server
Clusters
NAME        LAUNCHED     RESOURCES                                                                  STATUS  AUTOSTOP  COMMAND
api-server  41 mins ago  1x AWS(c6i.2xlarge, image_id={'us-east-1': 'docker:berkeleyskypilot/sk...  UP      -         sky exec api-server pip i...
  1. Upgrade the clients:

pip install -U skypilot-nightly==${VERSION}

Note

After upgrading the clients, they should not be used until the API server is upgraded to the new version.

  1. Upgrade the SkyPilot on the VM and restart the API server:

Note

Upgrading and restarting the API server will interrupt all pending and running requests.

sky exec api-server "pip install -U skypilot-nightly[all] && sky api stop && sky api start --deploy"
# Alternatively, you can also upgrade to a specific version with:
sky exec api-server "pip install -U skypilot-nightly[all]==${VERSION} && sky api stop && sky api start --deploy"
  1. Switch back to the remote API server:

ENDPOINT=$(sky status --endpoint api-server)
sky api login -e $ENDPOINT
  1. Verify the API server is running and the version is consistent with the version you upgraded to:

$ sky api info
Using SkyPilot API server: <ENDPOINT>
├── Status: healthy, commit: 022a5c3ffe258f365764b03cb20fac70934f5a60, version: 1.0.0.dev20250410
└── User: aclice (abcd1234)

Graceful upgrade#

A server can be gracefully upgraded when the following conditions are met:

Behavior when the API server is being upgraded:

  • For critical ongoing requests (e.g., launching a cluster), it waits for them to finish with a timeout.

  • For non-critical ongoing requests (e.g., log tailing), it cancels them and returns an error to ask the client to retry.

  • For new requests, it returns an error to ask the client to retry. New requests will be served when the new version of the API server is ready.

To further reduce the waiting time during upgrade, you can use rolling update for the API server.

SkyPilot Python SDK and CLI will automatically retry until the new version of API server starts, and ongoing requests (e.g., log tailing) will automatically resume:

GIF for graceful upgrade

To ensure that all the regular critical requests can complete within the timeout, you can adjust the timeout by setting apiService.terminationGracePeriodSeconds in helm values based on your workload, e.g.:

helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot-nightly --devel --reuse-values \
  --set apiService.terminationGracePeriodSeconds=300

Upgrade strategy#

By default, the API server is upgraded with the Recreate strategy, which introduces waiting time for new requests during upgrade. To eliminate the waiting time, you can upgrade the API server with the RollingUpdate strategy.

Note

RollingUpdate is an experimental feature. There is a known limitation that some running commands might fail when the old version of the API server gets removed from the ingress backend. It is recommended to schedule the upgrade during a maintenance window.

Warning

Managed jobs and local file mounts: Local file_mounts and workdir for managed jobs are stored on the pod’s ephemeral filesystem and will be lost when the old pod is replaced during a rolling update. To avoid this:

  • Enable persistent storage with a ReadWriteMany (RWX) PVC so both pods can access the files during the transition.

  • Alternatively, use cloud buckets, volumes, or git instead of local paths; or set jobs.bucket to redirect all local file uploads to a cloud bucket.

This does not apply if you are using a remote jobs controller.

The following table compares the two upgrade strategies:

Upgrade Strategy Comparison#

Aspect

Recreate

RollingUpdate

Availability

Brief downtime during upgrade

Zero downtime

Request Handling

New requests wait until upgrade completes

New requests served continuously by available replicas

Database Requirements

Can use local storage (SQLite)

Must use external persistent database

Resource Usage During Upgrade

Terminates old API server pod, then starts new one

Starts new API server pod, then terminates old one

Use Cases

Development environments, simple setups

Production environments requiring high availability

To use the RollingUpdate strategy, you need to:

Here’s an example of deploying the API server with the RollingUpdate strategy:

helm upgrade --install -n $NAMESPACE $RELEASE_NAME skypilot/skypilot-nightly --devel --reuse-values \
  --set apiService.upgradeStrategy=RollingUpdate \
  --set storage.enabled=false \
  --set apiService.dbConnectionSecretName=my-db-secret

Ingress config#

The SkyPilot helm chart automatically configures the ingress resource to achieve higher availability during upgrade. If you are managing the ingress resource outside of the SkyPilot helm chart, refer to the following snippet to improve the availability during upgrades:

Example ingress based on nginx-ingress-controller
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: your-ingress-name
  annotations:
    # Enable session affinity to route the requests of the same client to the same pod during upgrade.
    # Without session affinity, the chance that requests fail during upgrade would be higher.
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "SKYPILOT_ROUTEID"
    nginx.ingress.kubernetes.io/affinity-mode: "persistent"
    nginx.ingress.kubernetes.io/session-cookie-change-on-failure: "true"

API compatibility#

Starting from 0.10.0, SkyPilot guarantees API compatibility between adjacent minor versions, which makes graceful upgrades across minor versions possible.

For example, assuming 0.11.0 is released, the following table shows one possible upgrade sequence that can upgrade the API server and clients from 0.10.0 to 0.11.0 without breaking API compatibility:

Upgrade across minor versions#

Client

Server

Compatible

Notes

0.10.0

0.10.0

Yes

Initial state

0.10.0

0.11.0

Yes

Upgrade the API server first

0.11.0

0.11.0

Yes

Gradually upgrade all clients

When the client and server are running on different minor versions, SkyPilot CLI will print an upgrade hint as a reminder to upgrade the client:

$ sky status
The SkyPilot API server is running in version X, which is newer than your client version Y. The compatibility for your current version might be dropped in the next server upgrade.
Consider upgrading your client with:
pip install -U skypilot==X.X.X

For a nightly build, its API compatibility is equivalent to its previous minor version, e.g., all nightly builds after 0.10.0 and before 0.11.0 have the same API compatibility guarantee as 0.10.0.