Monitoring SkyPilot API Server Metrics#

SkyPilot API Server can export Prometheus-compatible metrics and optionally deploy a one-click Prometheus + Grafana stack so that you get a fully functional monitoring solution out of the box.

Tip

Metrics are disabled by default. All the knobs described below can be set via helm upgrade during the initial installation or a later upgrade.

Quickstart: enable the full metrics stack#

If you do not already have Prometheus or Grafana running, the quickest way to get started is to let the SkyPilot Helm chart deploy everything for you with a single command:

helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
  --namespace skypilot \
  --create-namespace \
  --reuse-values \
  --set apiService.metrics.enabled=true \
  --set prometheus.enabled=true \
  --set grafana.enabled=true

You can access Grafana at the /grafana endpoint:

# Fetch the endpoint URL
HOST=$(kubectl get svc ${RELEASE_NAME}-ingress-nginx-controller --namespace $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo http://$HOST/grafana

Metrics exposed#

The endpoint /grafana on the SkyPilot API server exposes the following metrics in standard Prometheus format:

API Server uptime
Requests per second grouped by HTTP status code
Request duration grouped by percentile
Requests per second grouped by endpoint path

You can also setup GPU metric collection to directly export GPU memory, utilization and power consumption.

Using existing Prometheus / Grafana#

The Helm chart introduces three new top-level blocks to provide flexibility in how you set up Prometheus and Grafana:

apiService.metrics.enabled – enables the /metrics HTTP endpoint on the SkyPilot API server.
prometheus.enabled – deploys a prometheus instance configured to scrape the /metrics endpoint on the SkyPilot API server.
grafana.enabled – deploys Grafana with a pre-baked dashboard to display the SkyPilot API server metrics from prometheus.

All three default to false so you can mix & match:

Fully managed Prometheus + Grafana – set apiService.metrics.enabled: true, prometheus.enabled: true, and grafana.enabled: true. The chart will deploy a fully managed Prometheus + Grafana stack.
External Prometheus / Grafana – set only apiService.metrics.enabled: true. The API server will expose the metrics on the /metrics endpoint and the pod will be annotated with prometheus.io/scrape: true to enable automatic scraping by prometheus.
External Grafana, internal Prometheus – enable prometheus but disable grafana. Point your existing Grafana at the Prometheus service created by the chart.