Monitoring Cluster-wide GPU Metrics#
SkyPilot provides native integration with Nvidia DCGM to surface real-time GPU metrics directly in the SkyPilot dashboard.

Prerequisites#
Before you begin, make sure your Kubernetes cluster meets the following requirements:
NVIDIA GPUs are available on your worker nodes.
The NVIDIA device plugin and the NVIDIA GPU Operator are installed.
DCGM-Exporter is running on the cluster and exposes metrics on port
9400
. Most GPU Operator installations already deploy DCGM-Exporter for you.
Set up DCGM metrics scraping#
Deploy the SkyPilot API server with GPU metrics enabled:
helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
--namespace skypilot \
--create-namespace \
--reuse-values \
--set apiService.metrics.enabled=true \
--set prometheus.enabled=true \
--set grafana.enabled=true
The flags do the following:
apiService.metrics.enabled
– turn on the/metrics
endpoint in the SkyPilot API server.prometheus.enabled
– deploy a Prometheus instance pre-configured to scrape both the SkyPilot API server and DCGM-Exporter.grafana.enabled
– deploy Grafana with an out-of-the-box dashboard that will be embedded in the SkyPilot dashboard.
What metrics are exposed?#
By default, the SkyPilot dashboard exposes the following metrics:
GPU utilization
GPU memory usage
GPU power usage
However, all metrics exported by DCGM exporter can be accessed via Prometheus/Grafana including GPU errors, NVLink stats and more.