Cluster Jobs#

You can run jobs on an existing cluster, which are automatically queued and scheduled.

This is ideal for interactive development on an existing cluster and reusing its setup.

Getting started#

Use sky exec to submit jobs to an existing cluster:

# Launch the job 5 times.
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d
sky exec mycluster job.yaml -d

The -d / --detach flag detaches logging from the terminal, which is useful for launching many long-running jobs concurrently.

To show a cluster’s jobs and their statuses:

# Show a cluster's jobs (job IDs, statuses).
sky queue mycluster

To show the output for each job:

# Stream the outputs of a job.
sky logs mycluster JOB_ID

To cancel a job:

# Cancel a job.
sky cancel mycluster JOB_ID

# Cancel all jobs on a cluster.
sky cancel mycluster --all

Tip

The sky launch command/CLI performs many steps in one call, including submitting jobs to an either existing or newly provisioned cluster. See here.

Multi-node jobs#

Jobs that run on multiple nodes are also supported.

First, create a cluster.yaml to specify the desired cluster:

num_nodes: 4
resources:
  accelerators: H100:8

workdir: ...
setup: |
  # Install dependencies.
  ...

Use sky launch -c mycluster cluster.yaml to provision a 4-node (each having 8 H100 GPUs) cluster. The num_nodes field is used to specify how many nodes are required.

Next, create a job.yaml to specify each job:

num_nodes: 2
resources:
  accelerators: H100:4

run: |
  # Run training script.
  ...

This specifies a job that needs to be run on 2 nodes, each of which must have 4 free H100s. You can then use sky exec mycluster job.yaml to submit this job.

See Distributed Multi-Node Jobs for more details.

Using CUDA_VISIBLE_DEVICES#

The environment variable CUDA_VISIBLE_DEVICES will be automatically set to the devices allocated to each job on each node. This variable is set when a job’s run commands are invoked.

For example, job.yaml above launches a 4-GPU job on each node that has 8 GPUs, so the job’s run commands will be invoked with CUDA_VISIBLE_DEVICES populated with 4 device IDs.

If your run commands use Docker/docker run, simply pass --gpus=all; the correct environment variable would be set inside the container (only the allocated device IDs will be set).

Example: Fractional GPUs#

To run multiple trials per GPU, use fractional GPUs in the resource requirement. For example, use --gpus H100:0.5 to make 2 trials share 1 GPU:

$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus H100:0.5 -d -- python train.py --lr 3e-3
...

When sharing a GPU, ensure that the GPU’s memory is not oversubscribed (otherwise, out-of-memory errors could occur).

Scheduling behavior#

SkyPilot’s scheduler serves two goals:

  1. Preventing resource oversubscription: SkyPilot schedules jobs on a cluster using their resource requirements—either specified in a job YAML’s resources field, or via the --gpus option of the sky exec CLI command. SkyPilot honors these resource requirements while ensuring that no resource in the cluster is oversubscribed. For example, if a node has 4 GPUs, it cannot host a combination of jobs whose sum of GPU requirements exceeds 4.

  2. Minimizing resource idleness: If a resource is idle, SkyPilot will schedule a queued job that can utilize that resource.

We illustrate the scheduling behavior by revisiting Tutorial: AI Training. In that tutorial, we have a job YAML that specifies these resource requirements:

# dnn.yaml
...
resources:
  accelerators: H100:4
...

Since a new cluster was created when we ran sky launch -c lm-cluster dnn.yaml, SkyPilot provisioned the cluster with exactly the same resources as those required for the job. Thus, lm-cluster has 4 H100 GPUs.

While this initial job is running, let us submit more jobs:

$ # Launch 4 jobs, perhaps with different hyperparameters.
$ # You can override the job name with `-n` (optional) and
$ # the resource requirement with `--gpus` (optional).
$ sky exec lm-cluster dnn.yaml -d -n job2 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job3 --gpus=H100:1
$ sky exec lm-cluster dnn.yaml -d -n job4 --gpus=H100:4
$ sky exec lm-cluster dnn.yaml -d -n job5 --gpus=H100:2

Because the cluster has only 4 H100 GPUs, we will see the following sequence of events:

  • The initial sky launch job is running and occupies 4 GPUs; all other jobs are pending (no free GPUs).

  • The first two sky exec jobs (job2, job3) then start running and occupy 1 GPU each.

  • The third job (job4) will be pending, since it requires 4 GPUs and there is only 2 free GPUs left.

  • The fourth job (job5) will start running, since its requirement is fulfilled with the 2 free GPUs.

  • Once all but job5 finish, the cluster’s 4 GPUs become free again and job4 will transition from pending to running.

Thus, we may see the following job statuses on this cluster:

$ sky queue lm-cluster

 ID  NAME         USER  SUBMITTED    STARTED     STATUS
 5   job5         user  10 mins ago  10 mins ago RUNNING
 4   job4         user  10 mins ago  -           PENDING
 3   job3         user  10 mins ago  9 mins ago  RUNNING
 2   job2         user  10 mins ago  9 mins ago  RUNNING
 1   huggingface  user  10 mins ago  1 min ago   SUCCEEDED