Migrating from Slurm to SkyPilot#

This guide helps users familiar with Slurm transition to SkyPilot. It covers command mappings, environment variables, script porting, and common workflow patterns.

Why use SkyPilot instead of Slurm?#

  • Multi-cluster made easy: With multiple Slurm clusters, users must manually track resource availability and use different login nodes for managing jobs. SkyPilot provides a single interface across multiple Slurm clusters, Kubernetes clusters, and cloud VMs.

  • Elasticity: Slurm clusters are fixed pools. SkyPilot running on the cloud(s) can burst to additional capacity when needed and scale down when idle.

  • Stronger isolation: Without cgroups, Slurm cannot enforce resource limits; a runaway job can crash others. SkyPilot provides stronger container-based isolation.

  • Dependency management: All Slurm jobs run in an identical environment and having different dependencies per-job can be tricky. SkyPilot provides full isolation for each job’s environment.

  • Unified dashboard: SkyPilot provides a web dashboard for job management, logs, and monitoring across all infrastructure.

Slurm to SkyPilot#

Most Slurm concepts map directly to SkyPilot concepts.

Slurm

SkyPilot

Notes

salloc --gpus=8

sky launch -c dev --gpus H100:8 then ssh dev

Interactive allocation (called a “cluster” in SkyPilot)

salloc + srun

sky launch -c dev task.yaml

Allocate then run commands

srun <command>

sky exec <cluster> <command>

Run command on existing allocation/cluster

squeue

sky status

View running clusters and jobs

exit (from salloc) or scancel <alloc_id>

sky down <cluster>

Terminate cluster/release allocation

sbatch script.sh

sky jobs launch task.yaml

Submit a batch job

scancel <jobid>

sky jobs cancel <job_id>

Cancel a job

sacct

sky jobs queue

View job history

sinfo

sky show-gpus

View available resources

SkyPilot also provides features not available in Slurm:

Feature

Description

sky serve

Model serving with autoscaling and load balancing

sky dashboard

Web UI for clusters, jobs, logs, and monitoring

sky api login

SSO authentication (Okta, Google Workspace, etc.)

sky volumes

Managed persistent volumes for data and checkpoints

Auto-failover

Automatic failover across clusters/clouds for higher GPU capacity

Object store mounting

Mount S3/GCS buckets directly to your jobs

Login node#

Slurm clusters have login nodes for submitting jobs and accessing shared storage. With SkyPilot:

  • No login node required: Run sky launch directly from your laptop.

  • For interactive work: SSH into your cluster after launching (ssh mycluster).

  • For batch workflows: Use managed jobs (sky jobs launch) which don’t require a persistent cluster.

Environment variable mapping#

SkyPilot exposes environment variables similar to Slurm for distributed jobs. See SkyPilot environment variables for full details.

Slurm

SkyPilot

Notes

$SLURM_JOB_NODELIST

$SKYPILOT_NODE_IPS

Newline-separated list of node IPs

$SLURM_NNODES

$SKYPILOT_NUM_NODES

Total number of nodes

$SLURM_NODEID / $SLURM_PROCID

$SKYPILOT_NODE_RANK

Node rank (0 to N-1)

$SLURM_GPUS_PER_NODE

$SKYPILOT_NUM_GPUS_PER_NODE

Number of GPUs per node

$SLURM_JOB_ID

$SKYPILOT_TASK_ID

Unique job identifier

Example usage in a distributed training script:

num_nodes: 2

resources:
  accelerators: H100:8

run: |
  HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    echo "I am the head node at $HEAD_IP"
  else
    echo "I am worker $SKYPILOT_NODE_RANK, connecting to $HEAD_IP"
  fi

Porting sbatch scripts to SkyPilot YAML#

Here’s a side-by-side comparison of a typical Slurm script and its SkyPilot equivalent:

Slurm sbatch script

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --partition=gpu

module load cuda/12.1
source ~/venv/bin/activate

srun python train.py --epochs 100

SkyPilot YAML

name: train

num_nodes: 2

resources:
  accelerators: H100:8
  cpus: 32+
  memory: 256+
  image_id: docker:nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

setup: pip install torch transformers

run: python train.py --epochs 100

Key differences:

  • No module system: Use setup: for environment configuration (pip, conda) or Docker images

  • Time limits are optional: SkyPilot uses autostop for auto-termination. Can be configured to terminate on idleness or wall-clock time.

  • Simpler syntax: Resource requirements are declarative YAML fields

  • Native container support: Easily use containers by setting image_id.

Resource requests#

Slurm

SkyPilot

Notes

--mem=64G

memory: 64+

Minimum memory in GB

--cpus-per-task=4

cpus: 4+

Minimum vCPUs

--gpus-per-node=8

accelerators: H100:8

GPU type and count

--time=24:00:00

autostop: 60m

Idle-based timeout

Example with resource constraints:

resources:
  accelerators: A100:4
  cpus: 16+
  memory: 128+
  disk_size: 500  # GB
  autostop:
    idle_minutes: 30

Interactive jobs#

Slurm’s salloc provides an interactive allocation. In SkyPilot, launch a cluster without a run command and SSH into it:

# Launch a cluster with GPUs
sky launch -c dev --gpus H100:8

# SSH into the cluster
ssh dev

# Or use VSCode Remote-SSH
code --remote ssh-remote+dev /path/to/code

For multi-node interactive clusters:

# Launch 4-node cluster
sky launch -c dev --gpus H100:8 --num-nodes 4

# SSH to head node
ssh dev

# SSH to worker nodes
ssh dev-worker1
ssh dev-worker2
ssh dev-worker3

When done, terminate with sky down dev or let autostop clean up idle clusters.

Job logs#

Slurm writes job output to slurm-<jobid>.out. SkyPilot provides several ways to access logs:

For clusters (``sky launch``):

sky logs mycluster           # Stream logs in real-time
sky logs mycluster 2         # View logs for job ID 2 on cluster

For managed jobs (``sky jobs launch``):

sky jobs logs <job_id>       # Stream logs for a managed job

Logs location on the cluster:

Logs are stored at ~/sky_logs/ on the cluster, organized by task ID.

Dashboard:

The SkyPilot dashboard provides a web UI to view all logs across clusters and jobs.

Job arrays and parameter sweeps#

Slurm job arrays (sbatch --array=1-100) allow running many similar jobs with different parameters.

In SkyPilot, use managed jobs with environment variables:

# Launch 100 jobs with different TASK_ID values
for i in $(seq 1 100); do
  sky jobs launch --env TASK_ID=$i -y -d task.yaml
done

Your task YAML can use TASK_ID to vary behavior:

envs:
  TASK_ID: null  # Required, passed via --env

run: |
  echo "Running task $TASK_ID"
  python train.py --seed $TASK_ID

For hyperparameter sweeps, you can also pass multiple environment variables:

for lr in 0.001 0.01 0.1; do
  for batch in 32 64 128; do
    sky jobs launch --env LR=$lr --env BATCH=$batch -y -d task.yaml
  done
done

Module system alternative#

Slurm clusters often use environment modules (module load cuda). With SkyPilot, you have several alternatives:

Use setup commands:

setup: |
  pip install torch==2.1.0
  pip install -r requirements.txt

Use Docker images:

resources:
  image_id: docker:pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

run: |
  python train.py

Use conda environments:

setup: |
  conda create -n myenv python=3.10 -y
  conda activate myenv
  conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y

run: |
  conda activate myenv
  python train.py

Identity and authentication#

Slurm tracks users by their Unix username. SkyPilot uses SSO authentication (Okta, Google Workspace, Microsoft Entra ID) with the SkyPilot API server. User identity is tied to their SSO email, providing:

  • Mapping of cluster and job ownership

  • Audit logs of who launched what

  • Role-based access control (RBAC)

Migrating to SkyPilot on Kubernetes#

SkyPilot runs on multiple backends including Kubernetes, cloud VMs, and even Slurm itself. If you’re migrating from Slurm to use SkyPilot on Kubernetes, the following sections cover K8s-specific considerations.

Shared storage on Kubernetes#

Important

Unlike Slurm clusters that typically have a shared NFS home directory mounted on all nodes, Kubernetes does not automatically mount home directories.

Here are the recommended approaches for shared storage on Kubernetes:

Option 1: Mount NFS via pod_config

If your Kubernetes cluster has access to an NFS server (e.g., already mounted on nodes), mount it to your pods:

resources:
  infra: kubernetes

run: |
  ls -la /mnt/shared
  python train.py --data /mnt/shared/datasets

config:
  kubernetes:
    pod_config:
      spec:
        containers:
          - volumeMounts:
              - mountPath: /mnt/shared
                name: nfs-volume
        volumes:
          - name: nfs-volume
            nfs:
              server: nfs.example.com
              path: /exports/shared

To apply this globally to all jobs, add the config section to your ~/.sky/config.yaml.

Option 2: SkyPilot Volumes (PVCs)

Create a shared volume using SkyPilot’s volumes feature:

# Create volume
sky volumes apply -f - <<EOF
name: shared-data
type: k8s-pvc
infra: kubernetes
size: 100Gi
config:
  access_mode: ReadWriteMany
EOF

# Mount in your task
cat > task.yaml <<EOF
volumes:
  /mnt/data: shared-data

run: |
  ls /mnt/data
EOF

sky launch task.yaml

Option 3: Cloud Buckets

Use cloud object storage for data that doesn’t require POSIX semantics:

file_mounts:
  /data:
    source: s3://my-bucket/datasets
    mode: MOUNT

run: |
  python train.py --data /data

Option 4: Sync Code with workdir

For syncing your local code to the cluster, use workdir:

workdir: ./my-project  # Local directory or git repository URL

run: |
  # Code is synced to ~/sky_workdir/
  python train.py

Partitions and queues on Kubernetes#

Slurm uses partitions (--partition=gpu) to direct jobs to specific resources. In SkyPilot on Kubernetes, you can target specific Kubernetes contexts or namespaces.

Via CLI:

sky launch --infra kubernetes/my-gpu-context task.yaml

Via YAML:

resources:
  infra: kubernetes/gpu-context

Using multiple contexts:

Configure allowed contexts in ~/.sky/config.yaml:

kubernetes:
  allowed_contexts:
    - cpu-context
    - gpu-context
    - high-memory-context

Then SkyPilot’s optimizer will choose the best context based on your resource requirements.

Priorities and quotas on Kubernetes#

For advanced scheduling similar to Slurm’s fair-share and priority systems:

  • Priority classes: Use Kubernetes priority classes for job preemption

  • Kueue integration: SkyPilot supports Kueue for advanced queuing, quotas and preemption

These features allow cluster admins to implement fair-share policies, user quotas, and priority-based scheduling similar to Slurm.

Further reading#