Cluster Job Queue#

SkyPilot’s job queue allows multiple jobs to be scheduled on a cluster.

Getting started#

Each task submitted by sky exec is automatically queued and scheduled for execution on an existing cluster:

# Launch the job 5 times.
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d

The -d / --detach flag detaches logging from the terminal, which is useful for launching many long-running jobs concurrently.

To show a cluster’s jobs and their statuses:

# Show a cluster's jobs (job IDs, statuses).
sky queue mycluster

To show the output for each job:

# Stream the outputs of a job.
sky logs mycluster JOB_ID

To cancel a job:

# Cancel a job.
sky cancel mycluster JOB_ID

# Cancel all jobs on a cluster.
sky cancel mycluster --all

Multi-node jobs#

Jobs that run on multiple nodes are also supported by the job queue.

First, create a cluster.yaml to specify the desired cluster:

num_nodes: 4
resources:
  accelerators: H100:8

workdir: ...
setup: |
  # Install dependencies.
  ...

Use sky launch -c mycluster cluster.yaml to provision a 4-node (each having 8 H100 GPUs) cluster. The num_nodes field is used to specify how many nodes are required.

Next, create a task.yaml to specify each task:

num_nodes: 2
resources:
  accelerators: H100:4

run: |
  # Run training script.
  ...

This specifies a task that needs to be run on 2 nodes, each of which must have 4 free H100s.

Use sky exec mycluster task.yaml to submit this task, which will be scheduled correctly by the job queue.

See Distributed Multi-Node Jobs for more details.

Using CUDA_VISIBLE_DEVICES#

The environment variable CUDA_VISIBLE_DEVICES will be automatically set to the devices allocated to each task on each node. This variable is set when a task’s run commands are invoked.

For example, task.yaml above launches a 4-GPU task on each node that has 8 GPUs, so the task’s run commands will be invoked with CUDA_VISIBLE_DEVICES populated with 4 device IDs.

If your run commands use Docker/docker run, simply pass --gpus=all; the correct environment variable would be set inside the container (only the allocated device IDs will be set).

Example: Fractional GPUs#

To run multiple trials per GPU, use fractional GPUs in the resource requirement. For example, use --gpus H100:0.5 to make 2 trials share 1 GPU:

$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 3e-3
...

When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).

Scheduling behavior#

SkyPilot’s scheduler serves two goals:

  1. Preventing resource oversubscription: SkyPilot schedules jobs on a cluster using their resource requirements—either specified in a task YAML’s resources field, or via the --gpus option of the sky exec CLI command. SkyPilot honors these resource requirements while ensuring that no resource in the cluster is oversubscribed. For example, if a node has 4 GPUs, it cannot host a combination of tasks whose sum of GPU requirements exceeds 4.

  2. Minimizing resource idleness: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource.

We illustrate the scheduling behavior by revisiting Tutorial: AI Training. In that tutorial, we have a task YAML that specifies these resource requirements:

# dnn.yaml
...
resources:
  accelerators: H100:4
...

Since a new cluster was created when we ran sky launch -c lm-cluster dnn.yaml, SkyPilot provisioned the cluster with exactly the same resources as those required for the task. Thus, lm-cluster has 4 H100 GPUs.

While this initial job is running, let us submit more tasks:

$ # Launch 4 jobs, perhaps with different hyperparameters.
$ # You can override the task name with `-n` (optional) and
$ # the resource requirement with `--gpus` (optional).
$ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=H100:4
$ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=H100:2

Because the cluster has only 4 H100 GPUs, we will see the following sequence of events:

  • The initial sky launch job is running and occupies 4 GPUs; all other jobs are pending (no free GPUs).

  • The first two sky exec jobs (job2, job3) then start running and occupy 1 GPU each.

  • The third job (job4) will be pending, since it requires 4 GPUs and there is only 2 free GPUs left.

  • The fourth job (job5) will start running, since its requirement is fulfilled with the 2 free GPUs.

  • Once all but job5 finish, the cluster’s 4 GPUs become free again and job4 will transition from pending to running.

Thus, we may see the following job statuses on this cluster:

$ sky queue lm-cluster

 ID  NAME         USER  SUBMITTED    STARTED     STATUS
 5   job5         user  10 mins ago  10 mins ago RUNNING
 4   job4         user  10 mins ago  -           PENDING
 3   job3         user  10 mins ago  9 mins ago  RUNNING
 2   job2         user  10 mins ago  9 mins ago  RUNNING
 1   huggingface  user  10 mins ago  1 min ago   SUCCEEDED