Monitoring Cluster-wide GPU Metrics#

SkyPilot provides native integration with NVIDIA DCGM to surface real-time GPU metrics directly in the SkyPilot dashboard.

GPU metrics dashboard

Prerequisites#

Before you begin, make sure your Kubernetes cluster meets the following requirements:

  • NVIDIA GPUs are available on your worker nodes.

  • The NVIDIA device plugin or the NVIDIA GPU Operator is installed.

  • DCGM-Exporter is running on the cluster and exposes metrics on port 9400. Most GPU Operator installations already deploy DCGM-Exporter for you.

  • Node Exporter is running on the cluster and exposes metrics on port 9100. This is required only if you want to monitor the CPU and Memory metrics.

Check the dcgm exporter setup#

Verify that Prometheus scrape annotations are correctly configured for DCGM-Exporter.

  1. Check whether the DCGM-Exporter Pod has the required Prometheus scrape annotations:

kubectl get pod $POD_NAME --namespace $NAMESPACE -o jsonpath='{.metadata.annotations}'

where $POD_NAME is the DCGM-Exporter Pod name and $NAMESPACE is its namespace. For CoreWeave managed Kubernetes clusters, the namespace is cw-exporters.

Confirm the following annotations exist:

prometheus.io/scrape: "true"
prometheus.io/port: "9400"
prometheus.io/path: "/metrics"

If the Pod already has these annotations, skip the rest of this section.

  1. If not, check whether a Service for DCGM-Exporter exists:

kubectl get svc -n $NAMESPACE | grep "dcgm-exporter"

where $NAMESPACE is the DCGM-Exporter namespace.

2.1 If the Service exists, verify its Prometheus scrape annotations:

kubectl get svc $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.metadata.annotations}'

Confirm the following annotations exist:

prometheus.io/scrape: "true"
prometheus.io/port: "9400"
prometheus.io/path: "/metrics"

If any are missing, edit the Service to add them.

kubectl edit svc $SERVICE_NAME -n $NAMESPACE

2.2 If the Service does not exist, create it:

kubectl create -f https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/metrics/dcgm_service.yaml -n $NAMESPACE

where $NAMESPACE is the DCGM-Exporter namespace.

Check the node exporter setup#

Verify that Prometheus scrape annotations are correctly configured for Node Exporter.

  1. Check whether the Node Exporter Pod has the required Prometheus scrape annotations:

kubectl get pod $POD_NAME --namespace $NAMESPACE -o jsonpath='{.metadata.annotations}'

where $POD_NAME is the Node Exporter Pod name and $NAMESPACE is its namespace. For CoreWeave managed Kubernetes clusters, the namespace is cw-exporters.

Confirm the following annotations exist:

prometheus.io/scrape: "true"
prometheus.io/port: "9100"
prometheus.io/path: "/metrics"

If the Pod already has these annotations, skip the rest of this section.

  1. If not, check whether a Service for Node Exporter exists:

kubectl get svc -n $NAMESPACE | grep "node-exporter"

where $NAMESPACE is the Node Exporter namespace.

2.1 If the Service exists, verify its Prometheus scrape annotations:

kubectl get svc $SERVICE_NAME -n $NAMESPACE -o jsonpath='{.metadata.annotations}'

Confirm the following annotations exist:

prometheus.io/scrape: "true"
prometheus.io/port: "9100"
prometheus.io/path: "/metrics"

If any are missing, edit the Service to add them.

kubectl edit svc $SERVICE_NAME -n $NAMESPACE

2.2 If the Service does not exist, create it:

kubectl create -f https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/metrics/node_exporter_service.yaml -n $NAMESPACE

where $NAMESPACE is the Node Exporter namespace.

Prometheus setup#

In the cluster where you deploy the API server, Prometheus is installed automatically as part of Set up DCGM metrics scraping.

For other Kubernetes clusters (external clusters), deploy Prometheus manually. SkyPilot also requires a Service skypilot-prometheus-server in the skypilot namespace to scrape metrics from external clusters.

If you use the Prometheus operator, e.g., the kube-prometheus-stack, install it in the skypilot namespace, then create the skypilot-prometheus-server Service in the same namespace.

kubectl create -f https://raw.githubusercontent.com/skypilot-org/skypilot/refs/heads/master/examples/metrics/skypilot_prometheus_server_service.yaml -n skypilot

Alternatively, install the SkyPilot Prometheus server chart; it will create the skypilot-prometheus-server Service automatically:

helm upgrade --install skypilot skypilot/skypilot-prometheus-server --devel \
 --namespace skypilot \
 --create-namespace

If you are using the Nebius Kubernetes cluster, refer to Setup GPU metrics in Nebius Kubernetes cluster for how to setup the GPU metrics.

Set up DCGM metrics scraping#

Deploy the SkyPilot API server with GPU metrics enabled:

helm upgrade --install skypilot skypilot/skypilot-nightly --devel \
  --namespace skypilot \
  --create-namespace \
  --reuse-values \
  --set apiService.metrics.enabled=true \
  --set prometheus.enabled=true \
  --set grafana.enabled=true

The flags do the following:

  • apiService.metrics.enabled – turn on the /metrics endpoint in the SkyPilot API server.

  • prometheus.enabled – deploy a Prometheus instance pre-configured to scrape both the SkyPilot API server and DCGM-Exporter.

  • grafana.enabled – deploy Grafana with an out-of-the-box dashboard that will be embedded in the SkyPilot dashboard.

What metrics are exposed?#

By default, the SkyPilot dashboard exposes the following metrics:

  • GPU utilization

  • GPU memory usage

  • GPU power usage

  • GPU temperature

  • CPU utilization

  • Memory usage

However, all metrics exported by DCGM exporter can be accessed via Prometheus/Grafana including GPU errors, NVLink stats and more.