Monitoring Cluster-wide GPU Metrics#

SkyPilot provides native integration with Nvidia DCGM to surface real-time GPU metrics directly in the SkyPilot dashboard.

Prerequisites#

Before you begin, make sure your Kubernetes cluster meets the following requirements:

NVIDIA GPUs are available on your worker nodes.
The NVIDIA device plugin and the NVIDIA GPU Operator are installed.
DCGM-Exporter is running on the cluster and exposes metrics on port 9400. Most GPU Operator installations already deploy DCGM-Exporter for you.

Set up DCGM metrics scraping#

Deploy the SkyPilot API server with GPU metrics enabled:

helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
  --namespace skypilot \
  --create-namespace \
  --reuse-values \
  --set apiService.metrics.enabled=true \
  --set prometheus.enabled=true \
  --set grafana.enabled=true

The flags do the following:

apiService.metrics.enabled – turn on the /metrics endpoint in the SkyPilot API server.
prometheus.enabled – deploy a Prometheus instance pre-configured to scrape both the SkyPilot API server and DCGM-Exporter.
grafana.enabled – deploy Grafana with an out-of-the-box dashboard that will be embedded in the SkyPilot dashboard.

What metrics are exposed?#

By default, the SkyPilot dashboard exposes the following metrics:

GPU utilization
GPU memory usage
GPU power usage

However, all metrics exported by DCGM exporter can be accessed via Prometheus/Grafana including GPU errors, NVLink stats and more.