Cluster Jobs#

You can run jobs on an existing cluster, which are automatically queued and scheduled.

This is ideal for interactive development on an existing cluster and reusing its setup.

Getting started#

Use sky exec to submit jobs to an existing cluster:

# Launch the job 5 times.
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d

The -d / --detach flag detaches logging from the terminal, which is useful for launching many long-running jobs concurrently.

To show a cluster’s jobs and their statuses:

# Show a cluster's jobs (job IDs, statuses).
sky queue mycluster

To show the output for each job:

# Stream the outputs of a job.
sky logs mycluster JOB_ID

To cancel a job:

# Cancel a job.
sky cancel mycluster JOB_ID

# Cancel all jobs on a cluster.
sky cancel mycluster --all

Tip

The sky launch command/CLI performs many steps in one call, including submitting jobs to an either existing or newly provisioned cluster. See here.

Multi-node jobs#

Jobs that run on multiple nodes are also supported.

First, create a cluster.yaml to specify the desired cluster:

num_nodes: 4
resources:
  accelerators: H100:8

workdir: ...
setup: |
  # Install dependencies.
  ...

Use sky launch -c mycluster cluster.yaml to provision a 4-node (each having 8 H100 GPUs) cluster. The num_nodes field is used to specify how many nodes are required.

Next, create a job.yaml to specify each job:

num_nodes: 2
resources:
  accelerators: H100:4

run: |
  # Run training script.
  ...

This specifies a job that needs to be run on 2 nodes, each of which must have 4 free H100s. You can then use sky exec mycluster job.yaml to submit this job.

See Distributed Multi-Node Jobs for more details.

Using `CUDA_VISIBLE_DEVICES`#

The environment variable CUDA_VISIBLE_DEVICES will be automatically set to the devices allocated to each job on each node. This variable is set when a job’s run commands are invoked.

For example, job.yaml above launches a 4-GPU job on each node that has 8 GPUs, so the job’s run commands will be invoked with CUDA_VISIBLE_DEVICES populated with 4 device IDs.

If your run commands use Docker/docker run, simply pass --gpus=all; the correct environment variable would be set inside the container (only the allocated device IDs will be set).

Example: Grid search#

To submit multiple trials with different hyperparameters to a cluster:

$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 3e-3
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-4
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-2
$ sky exec mycluster --gpus H100:1 -d -- python train.py --lr 1e-6

Options used:

--gpus: specify the resource requirement for each job.
-d / --detach: detach the run and logging from the terminal, allowing multiple trials to run concurrently.

If there are only 4 H100 GPUs on the cluster, SkyPilot will queue 1 job while the other 4 run in parallel. Once a job finishes, the next job will begin executing immediately. See below for more details on SkyPilot’s scheduling behavior.

Tip

You can also use environment variables to set different arguments for each trial.

Example: Fractional GPUs#

To run multiple trials per GPU, use fractional GPUs in the resource requirement. For example, use --gpus H100:0.5 to make 2 trials share 1 GPU:

$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 3e-3
...

When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).

Scheduling behavior#

SkyPilot’s scheduler serves two goals:

Preventing resource oversubscription: SkyPilot schedules jobs on a cluster using their resource requirements—either specified in a job YAML’s resources field, or via the --gpus option of the sky exec CLI command. SkyPilot honors these resource requirements while ensuring that no resource in the cluster is oversubscribed. For example, if a node has 4 GPUs, it cannot host a combination of jobs whose sum of GPU requirements exceeds 4.
Minimizing resource idleness: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource.

We illustrate the scheduling behavior by revisiting Tutorial: AI Training. In that tutorial, we have a job YAML that specifies these resource requirements:

# dnn.yaml
...
resources:
  accelerators: H100:4
...

Since a new cluster was created when we ran sky launch -c lm-cluster dnn.yaml, SkyPilot provisioned the cluster with exactly the same resources as those required for the job. Thus, lm-cluster has 4 H100 GPUs.

While this initial job is running, let us submit more jobs:

$ # Launch 4 jobs, perhaps with different hyperparameters.
$ # You can override the job name with `-n` (optional) and
$ # the resource requirement with `--gpus` (optional).
$ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=H100:4
$ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=H100:2

Because the cluster has only 4 H100 GPUs, we will see the following sequence of events:

The initial sky launch job is running and occupies 4 GPUs; all other jobs are pending (no free GPUs).
The first two sky exec jobs (job2, job3) then start running and occupy 1 GPU each.
The third job (job4) will be pending, since it requires 4 GPUs and there is only 2 free GPUs left.
The fourth job (job5) will start running, since its requirement is fulfilled with the 2 free GPUs.
Once all but job5 finish, the cluster’s 4 GPUs become free again and job4 will transition from pending to running.

Thus, we may see the following job statuses on this cluster:

$ sky queue lm-cluster

 ID  NAME         USER  SUBMITTED    STARTED     STATUS
 job5         user  10 mins ago  10 mins ago RUNNING
 job4         user  10 mins ago  -           PENDING
 job3         user  10 mins ago  9 mins ago  RUNNING
 job2         user  10 mins ago  9 mins ago  RUNNING
 huggingface  user  10 mins ago  1 min ago   SUCCEEDED

Cluster Jobs#

Getting started#

Multi-node jobs#

Using CUDA_VISIBLE_DEVICES#

Example: Grid search#

Example: Fractional GPUs#

Scheduling behavior#

Using `CUDA_VISIBLE_DEVICES`#