Migrating from Slurm to SkyPilot#
This guide helps users familiar with Slurm transition to SkyPilot. It covers command mappings, environment variables, script porting, and common workflow patterns.
Why use SkyPilot instead of Slurm?#
Multi-cluster made easy: With multiple Slurm clusters, users must manually track resource availability and use different login nodes for managing jobs. SkyPilot provides a single interface across multiple Slurm clusters, Kubernetes clusters, and cloud VMs.
Elasticity: Slurm clusters are fixed pools. SkyPilot running on the cloud(s) can burst to additional capacity when needed and scale down when idle.
Stronger isolation: Without cgroups, Slurm cannot enforce resource limits; a runaway job can crash others. SkyPilot provides stronger container-based isolation.
Dependency management: All Slurm jobs run in an identical environment and having different dependencies per-job can be tricky. SkyPilot provides full isolation for each job’s environment.
Unified dashboard: SkyPilot provides a web dashboard for job management, logs, and monitoring across all infrastructure.
Slurm to SkyPilot#
Most Slurm concepts map directly to SkyPilot concepts.
Slurm |
SkyPilot |
Notes |
|---|---|---|
|
|
Interactive allocation (called a “cluster” in SkyPilot) |
|
|
Allocate then run commands |
|
|
Run command on existing allocation/cluster |
|
|
View running clusters and jobs |
|
|
Terminate cluster/release allocation |
|
|
Submit a batch job |
|
|
Cancel a job |
|
|
View job history |
|
|
View available resources |
SkyPilot also provides features not available in Slurm:
Feature |
Description |
|---|---|
|
Model serving with autoscaling and load balancing |
|
Web UI for clusters, jobs, logs, and monitoring |
|
SSO authentication (Okta, Google Workspace, etc.) |
|
Managed persistent volumes for data and checkpoints |
Auto-failover |
Automatic failover across clusters/clouds for higher GPU capacity |
Object store mounting |
Mount S3/GCS buckets directly to your jobs |
Login node#
Slurm clusters have login nodes for submitting jobs and accessing shared storage. With SkyPilot:
No login node required: Run
sky launchdirectly from your laptop.For interactive work: SSH into your cluster after launching (
ssh mycluster).For batch workflows: Use managed jobs (
sky jobs launch) which don’t require a persistent cluster.
Environment variable mapping#
SkyPilot exposes environment variables similar to Slurm for distributed jobs. See SkyPilot environment variables for full details.
Slurm |
SkyPilot |
Notes |
|---|---|---|
|
|
Newline-separated list of node IPs |
|
|
Total number of nodes |
|
|
Node rank (0 to N-1) |
|
|
Number of GPUs per node |
|
|
Unique job identifier |
Example usage in a distributed training script:
num_nodes: 2
resources:
accelerators: H100:8
run: |
HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
echo "I am the head node at $HEAD_IP"
else
echo "I am worker $SKYPILOT_NODE_RANK, connecting to $HEAD_IP"
fi
Porting sbatch scripts to SkyPilot YAML#
Here’s a side-by-side comparison of a typical Slurm script and its SkyPilot equivalent:
Slurm sbatch script
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --partition=gpu
module load cuda/12.1
source ~/venv/bin/activate
srun python train.py --epochs 100
SkyPilot YAML
name: train
num_nodes: 2
resources:
accelerators: H100:8
cpus: 32+
memory: 256+
image_id: docker:nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
setup: pip install torch transformers
run: python train.py --epochs 100
Key differences:
No module system: Use
setup:for environment configuration (pip, conda) or Docker imagesTime limits are optional: SkyPilot uses autostop for auto-termination. Can be configured to terminate on idleness or wall-clock time.
Simpler syntax: Resource requirements are declarative YAML fields
Native container support: Easily use containers by setting
image_id.
Resource requests#
Slurm |
SkyPilot |
Notes |
|---|---|---|
|
|
Minimum memory in GB |
|
|
Minimum vCPUs |
|
|
GPU type and count |
|
|
Idle-based timeout |
Example with resource constraints:
resources:
accelerators: A100:4
cpus: 16+
memory: 128+
disk_size: 500 # GB
autostop:
idle_minutes: 30
Interactive jobs#
Slurm’s salloc provides an interactive allocation. In SkyPilot, launch a cluster without a run command and SSH into it:
# Launch a cluster with GPUs
sky launch -c dev --gpus H100:8
# SSH into the cluster
ssh dev
# Or use VSCode Remote-SSH
code --remote ssh-remote+dev /path/to/code
For multi-node interactive clusters:
# Launch 4-node cluster
sky launch -c dev --gpus H100:8 --num-nodes 4
# SSH to head node
ssh dev
# SSH to worker nodes
ssh dev-worker1
ssh dev-worker2
ssh dev-worker3
When done, terminate with sky down dev or let autostop clean up idle clusters.
Job logs#
Slurm writes job output to slurm-<jobid>.out. SkyPilot provides several ways to access logs:
For clusters (``sky launch``):
sky logs mycluster # Stream logs in real-time
sky logs mycluster 2 # View logs for job ID 2 on cluster
For managed jobs (``sky jobs launch``):
sky jobs logs <job_id> # Stream logs for a managed job
Logs location on the cluster:
Logs are stored at ~/sky_logs/ on the cluster, organized by task ID.
Dashboard:
The SkyPilot dashboard provides a web UI to view all logs across clusters and jobs.
Job arrays and parameter sweeps#
Slurm job arrays (sbatch --array=1-100) allow running many similar jobs with different parameters.
In SkyPilot, use managed jobs with environment variables:
# Launch 100 jobs with different TASK_ID values
for i in $(seq 1 100); do
sky jobs launch --env TASK_ID=$i -y -d task.yaml
done
Your task YAML can use TASK_ID to vary behavior:
envs:
TASK_ID: null # Required, passed via --env
run: |
echo "Running task $TASK_ID"
python train.py --seed $TASK_ID
For hyperparameter sweeps, you can also pass multiple environment variables:
for lr in 0.001 0.01 0.1; do
for batch in 32 64 128; do
sky jobs launch --env LR=$lr --env BATCH=$batch -y -d task.yaml
done
done
Module system alternative#
Slurm clusters often use environment modules (module load cuda). With SkyPilot, you have several alternatives:
Use setup commands:
setup: |
pip install torch==2.1.0
pip install -r requirements.txt
Use Docker images:
resources:
image_id: docker:pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
run: |
python train.py
Use conda environments:
setup: |
conda create -n myenv python=3.10 -y
conda activate myenv
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y
run: |
conda activate myenv
python train.py
Identity and authentication#
Slurm tracks users by their Unix username. SkyPilot uses SSO authentication (Okta, Google Workspace, Microsoft Entra ID) with the SkyPilot API server. User identity is tied to their SSO email, providing:
Mapping of cluster and job ownership
Audit logs of who launched what
Role-based access control (RBAC)
Migrating to SkyPilot on Kubernetes#
SkyPilot runs on multiple backends including Kubernetes, cloud VMs, and even Slurm itself. If you’re migrating from Slurm to use SkyPilot on Kubernetes, the following sections cover K8s-specific considerations.
Partitions and queues on Kubernetes#
Slurm uses partitions (--partition=gpu) to direct jobs to specific resources. In SkyPilot on Kubernetes, you can target specific Kubernetes contexts or namespaces.
Via CLI:
sky launch --infra kubernetes/my-gpu-context task.yaml
Via YAML:
resources:
infra: kubernetes/gpu-context
Using multiple contexts:
Configure allowed contexts in ~/.sky/config.yaml:
kubernetes:
allowed_contexts:
- cpu-context
- gpu-context
- high-memory-context
Then SkyPilot’s optimizer will choose the best context based on your resource requirements.
Priorities and quotas on Kubernetes#
For advanced scheduling similar to Slurm’s fair-share and priority systems:
Priority classes: Use Kubernetes priority classes for job preemption
Kueue integration: SkyPilot supports Kueue for advanced queuing, quotas and preemption
These features allow cluster admins to implement fair-share policies, user quotas, and priority-based scheduling similar to Slurm.
Further reading#
Quickstart: Get started with SkyPilot
Interactive development: Develop on your laptop and run on the cloud
Distributed jobs: Multi-node training guide
Managed jobs: Fault-tolerant batch jobs