Cluster Job Queue#
SkyPilot’s job queue allows multiple jobs to be scheduled on a cluster.
Getting started#
Each task submitted by sky exec
is automatically queued and scheduled
for execution on an existing cluster:
# Launch the job 5 times.
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
sky exec mycluster task.yaml -d
The -d / --detach
flag detaches logging from the terminal, which is useful for launching many long-running jobs concurrently.
To show a cluster’s jobs and their statuses:
# Show a cluster's jobs (job IDs, statuses).
sky queue mycluster
To show the output for each job:
# Stream the outputs of a job.
sky logs mycluster JOB_ID
To cancel a job:
# Cancel a job.
sky cancel mycluster JOB_ID
# Cancel all jobs on a cluster.
sky cancel mycluster --all
Multi-node jobs#
Jobs that run on multiple nodes are also supported by the job queue.
First, create a cluster.yaml
to specify the desired cluster:
num_nodes: 4
resources:
accelerators: H100:8
workdir: ...
setup: |
# Install dependencies.
...
Use sky launch -c mycluster cluster.yaml
to provision a 4-node (each having 8 H100 GPUs) cluster.
The num_nodes
field is used to specify how many nodes are required.
Next, create a task.yaml
to specify each task:
num_nodes: 2
resources:
accelerators: H100:4
run: |
# Run training script.
...
This specifies a task that needs to be run on 2 nodes, each of which must have 4 free H100s.
Use sky exec mycluster task.yaml
to submit this task, which will be scheduled correctly by the job queue.
See Distributed Multi-Node Jobs for more details.
Using CUDA_VISIBLE_DEVICES
#
The environment variable CUDA_VISIBLE_DEVICES
will be automatically set to
the devices allocated to each task on each node. This variable is set
when a task’s run
commands are invoked.
For example, task.yaml
above launches a 4-GPU task on each node that has 8
GPUs, so the task’s run
commands will be invoked with
CUDA_VISIBLE_DEVICES
populated with 4 device IDs.
If your run
commands use Docker/docker run
, simply pass --gpus=all
;
the correct environment variable would be set inside the container (only the
allocated device IDs will be set).
Example: Grid Search#
To submit multiple trials with different hyperparameters to a cluster:
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 3e-3
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-4
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-2
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-6
Options used:
--gpus
: specify the resource requirement for each job.-d
/--detach
: detach the run and logging from the terminal, allowing multiple trials to run concurrently.
If there are only 4 H100 GPUs on the cluster, SkyPilot will queue 1 job while the other 4 run in parallel. Once a job finishes, the next job will begin executing immediately. See below for more details on SkyPilot’s scheduling behavior.
Tip
You can also use environment variables to set different arguments for each trial.
Example: Fractional GPUs#
To run multiple trials per GPU, use fractional GPUs in the resource requirement.
For example, use --gpus H100:0.5
to make 2 trials share 1 GPU:
$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 3e-3
...
When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).
Scheduling behavior#
SkyPilot’s scheduler serves two goals:
Preventing resource oversubscription: SkyPilot schedules jobs on a cluster using their resource requirements—either specified in a task YAML’s
resources
field, or via the--gpus
option of thesky exec
CLI command. SkyPilot honors these resource requirements while ensuring that no resource in the cluster is oversubscribed. For example, if a node has 4 GPUs, it cannot host a combination of tasks whose sum of GPU requirements exceeds 4.Minimizing resource idleness: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource.
We illustrate the scheduling behavior by revisiting Tutorial: AI Training. In that tutorial, we have a task YAML that specifies these resource requirements:
# dnn.yaml
...
resources:
accelerators: H100:4
...
Since a new cluster was created when we ran sky launch -c lm-cluster
dnn.yaml
, SkyPilot provisioned the cluster with exactly the same resources as those
required for the task. Thus, lm-cluster
has 4 H100 GPUs.
While this initial job is running, let us submit more tasks:
$ # Launch 4 jobs, perhaps with different hyperparameters.
$ # You can override the task name with `-n` (optional) and
$ # the resource requirement with `--gpus` (optional).
$ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=H100:4
$ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=H100:2
Because the cluster has only 4 H100 GPUs, we will see the following sequence of events:
The initial
sky launch
job is running and occupies 4 GPUs; all other jobs are pending (no free GPUs).The first two
sky exec
jobs (job2, job3) then start running and occupy 1 GPU each.The third job (job4) will be pending, since it requires 4 GPUs and there is only 2 free GPUs left.
The fourth job (job5) will start running, since its requirement is fulfilled with the 2 free GPUs.
Once all but job5 finish, the cluster’s 4 GPUs become free again and job4 will transition from pending to running.
Thus, we may see the following job statuses on this cluster:
$ sky queue lm-cluster
ID NAME USER SUBMITTED STARTED STATUS
5 job5 user 10 mins ago 10 mins ago RUNNING
4 job4 user 10 mins ago - PENDING
3 job3 user 10 mins ago 9 mins ago RUNNING
2 job2 user 10 mins ago 9 mins ago RUNNING
1 huggingface user 10 mins ago 1 min ago SUCCEEDED