Managed Jobs#

Tip

This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel.

SkyPilot supports managed jobs (sky jobs), which can automatically retry failures, recover from spot instance preemptions, and clean up when done.

To start a managed job, use sky jobs launch:

$ sky jobs launch -n myjob hello_sky.yaml
Task from YAML spec: hello_sky.yaml
Managed job 'myjob' will be launched on (estimated):
Considered resources (1 node):
------------------------------------------------------------------------------------------
 CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
------------------------------------------------------------------------------------------
 AWS     m6i.2xlarge   8       32        -              us-east-1     0.38          ✔
------------------------------------------------------------------------------------------
Launching a managed job 'myjob'. Proceed? [Y/n]: Y
... <job is submitted and launched>
(setup pid=2383) Running setup.
(myjob, pid=2383) Hello, SkyPilot!
✓ Managed job finished: 1 (status: SUCCEEDED).

Managed Job ID: 1
📋 Useful Commands
├── To cancel the job:                sky jobs cancel 1
├── To stream job logs:               sky jobs logs 1
├── To stream controller logs:        sky jobs logs --controller 1
├── To view all managed jobs:         sky jobs queue
└── To view managed job dashboard:    sky jobs dashboard

The job is managed end-to-end and resources are automatically cleaned up.

Managed jobs have several benefits:

Use spot instances: Jobs can run on auto-recovering spot instances. This saves significant costs (e.g., ~70% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
Recover from failure: When a job fails, you can automatically retry it on a new cluster, eliminating flaky issues.
Managed pipelines: Run pipelines that contain multiple tasks (which can have different resource requirements and setup/run commands). Useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it.

Managed spot jobs#

Managed jobs can run on spot instances, and preemptions are auto-recovered by SkyPilot.

To launch a managed spot job, use sky jobs launch --use-spot. SkyPilot automatically finds available spot instances across regions and clouds to maximize availability. Any spot preemptions are automatically handled by SkyPilot without user intervention.

Here is an example of a BERT training job failing over different regions across AWS and GCP.

To use managed spot jobs, there are two requirements:

Job YAML: Managed Spot requires a YAML to describe the job, tested with sky launch.
Checkpointing (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a mounted cloud bucket. The program can reload the latest checkpoint when restarted.

Quick comparison between managed spot jobs vs. launching unmanaged spot clusters:

Command	Managed?	SSH-able?	Best for
`sky jobs launch --use-spot`	Yes, preemptions are auto-recovered	No	Scaling out long-running jobs (e.g., data processing, training, batch inference)
`sky launch --use-spot`	No, preemptions are not handled	Yes	Interactive dev on spot instances (especially for hardware with low preemption rates)

Job YAML#

To launch a managed job, you can simply reuse your job YAML (recommended to test it with sky launch first). For example, we found the BERT fine-tuning YAML works with sky launch, and want to launch it with SkyPilot managed spot jobs.

We can launch it with the following:

$ git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
$ sky jobs launch -n bert-qa bert_qa.yaml

# bert_qa.yaml
name: bert-qa

resources:
  accelerators: V100:1
  use_spot: true  # Use spot instances to save cost.

envs:
  # Fill in your wandb key: copy from https://wandb.ai/authorize
  # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
  # to pass the key in the command line, during `sky jobs launch`.
  WANDB_API_KEY:

# Assume your working directory is under `~/transformers`.
workdir: ~/transformers

setup: |
  pip install -e .
  cd examples/pytorch/question-answering/
  pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  pip install wandb

run: |
  cd examples/pytorch/question-answering/
  python run_qa.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name squad \
    --do_train \
    --do_eval \
    --per_device_train_batch_size 12 \
    --learning_rate 3e-5 \
    --num_train_epochs 50 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --report_to wandb \
    --output_dir /tmp/bert_qa/

Note

workdir and file mounts with local files will be automatically uploaded to a cloud bucket. The bucket will be created during the job running time, and cleaned up after the job finishes.

SkyPilot will launch and start monitoring the job. When a spot preemption or any machine failure happens, SkyPilot will automatically search for resources across regions and clouds to re-launch the job.

In this example, the job will be restarted from scratch after each preemption recovery. To resume the job from previous states, user’s application needs to implement checkpointing and recovery.

Checkpointing and Recovery#

To allow job recovery, a cloud bucket is typically needed to store the job’s states (e.g., model checkpoints). Below is an example of mounting a bucket to /checkpoint.

file_mounts:
  /checkpoint:
    name: # NOTE: Fill in your bucket name
    mode: MOUNT

The MOUNT mode in SkyPilot bucket mounting ensures the checkpoints outputted to /checkpoint are automatically synced to a persistent bucket. Note that the application code should save program checkpoints periodically and reload those states when the job is restarted. This is typically achieved by reloading the latest checkpoint at the beginning of your program.

An End-to-End Example#

Below we show an example for fine-tuning a BERT model on a question-answering task with HuggingFace.

# bert_qa.yaml
name: bert-qa

resources:
  accelerators: V100:1
  use_spot: true  # Use spot instances to save cost.

file_mounts:
  /checkpoint:
    name: # NOTE: Fill in your bucket name
    mode: MOUNT

envs:
  # Fill in your wandb key: copy from https://wandb.ai/authorize
  # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
  # to pass the key in the command line, during `sky jobs launch`.
  WANDB_API_KEY:

# Assume your working directory is under `~/transformers`.
workdir: ~/transformers

setup: |
  pip install -e .
  cd examples/pytorch/question-answering/
  pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  pip install wandb

run: |
  cd examples/pytorch/question-answering/
  python run_qa.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name squad \
    --do_train \
    --do_eval \
    --per_device_train_batch_size 12 \
    --learning_rate 3e-5 \
    --num_train_epochs 50 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --report_to wandb \
    --output_dir /checkpoint/bert_qa/ \
    --run_name $SKYPILOT_TASK_ID \
    --save_total_limit 10 \
    --save_steps 1000

As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up the output directory and frequency of checkpointing (see more on Huggingface API). You may also refer to another example here for periodically checkpointing with PyTorch.

We also set --run_name to $SKYPILOT_TASK_ID so that the logs for all recoveries of the same job will be saved to the same run in Weights & Biases.

Note

The environment variable $SKYPILOT_TASK_ID (example: “sky-managed-2022-10-06-05-17-09-750781_bert-qa_8-0”) can be used to identify the same job, i.e., it is kept identical across all recoveries of the job. It can be accessed in the task’s run commands or directly in the program itself (e.g., access via os.environ and pass to Weights & Biases for tracking purposes in your training script). It is made available to the task whenever it is invoked.

With the highlighted changes, the managed spot job can now resume training after preemption! We can enjoy the benefits of cost savings from spot instances without worrying about preemption or losing progress.

$ sky jobs launch -n bert-qa bert_qa.yaml

Tip

Try copy-paste this example and adapt it to your own job.

Real-World Examples#

Vicuna LLM chatbot: instructions, YAML
BERT (shown above): YAML
PyTorch DDP, ResNet: YAML
PyTorch Lightning DDP, CIFAR-10: YAML

Managed on-demand/reserved jobs#

The same sky jobs launch and YAML interfaces can run jobs on auto-recovering on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying machine failures and transparently recover the job.

To do so, simply set use_spot: false in the resources section, or override it with --use-spot false in the CLI.

$ sky jobs launch -n bert-qa bert_qa.yaml --use-spot false

Tip

It is useful to think of sky jobs launch as a “serverless” managed job interface, while sky launch is a cluster interface (that you can launch tasks on, albeit not managed).

Either Spot or On-Demand/Reserved#

You can use any_of to specify either spot or on-demand/reserved instances as candidate resources for a job. See documentation here for more details.

resources:
  accelerators: A100:8
  any_of:
    - use_spot: true
    - use_spot: false

In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances.

Jobs restarts on user code failure#

By default, SkyPilot will try to recover a job when its underlying cluster is preempted or failed. Any user code failures (non-zero exit codes) are not auto-recovered.

In some cases, you may want a job to automatically restart on its own failures, e.g., when a training job crashes due to a Nvidia driver issue or NCCL timeouts. To specify this, you can set max_restarts_on_errors in resources.job_recovery in the job YAML file.

resources:
  accelerators: A100:8
  job_recovery:
    # Restart the job up to 3 times on user code errors.
    max_restarts_on_errors: 3

More advanced policies for resource selection, such as the Can’t Be Late (NSDI’24) paper, may be supported in the future.

Running many parallel jobs#

For batch jobs such as data processing or hyperparameter sweeps, you can launch many jobs in parallel. See Many Parallel Jobs.

Useful CLIs#

Here are some commands for managed jobs. Check sky jobs --help and CLI reference for more details.

See all managed jobs:

$ sky jobs queue

Fetching managed jobs...
Managed jobs:
ID NAME     RESOURCES           SUBMITTED   TOT. DURATION   JOB DURATION   #RECOVERIES  STATUS
2  roberta  1x [A100:8][Spot]   2 hrs ago   2h 47m 18s      2h 36m 18s     0            RUNNING
1  bert-qa  1x [V100:1][Spot]   4 hrs ago   4h 24m 26s      4h 17m 54s     0            RUNNING

Stream the logs of a running managed job:

$ sky jobs logs -n bert-qa  # by name
$ sky jobs logs 2           # by job ID

Cancel a managed job:

$ sky jobs cancel -n bert-qa  # by name
$ sky jobs cancel 2           # by job ID

Note

If any failure happens for a managed job, you can check sky jobs queue -a for the brief reason of the failure. For more details, it would be helpful to check sky jobs logs --controller <job_id>.

Managed pipelines#

A pipeline is a managed job that contains a sequence of tasks running one after another.

This is useful for running a sequence of tasks that depend on each other, e.g., training a model and then running inference on it. Different tasks can have different resource requirements to use appropriate per-task resources, which saves costs, while keeping the burden of managing the tasks off the user.

Note

In other words, a managed job is either a single task or a pipeline of tasks. All managed jobs are submitted by sky jobs launch.

To run a pipeline, specify the sequence of tasks in a YAML file. Here is an example:

name: pipeline

---

name: train

resources:
  accelerators: V100:8
  any_of:
    - use_spot: true
    - use_spot: false

file_mounts:
  /checkpoint:
    name: train-eval # NOTE: Fill in your bucket name
    mode: MOUNT

setup: |
  echo setup for training

run: |
  echo run for training
  echo save checkpoints to /checkpoint

---

name: eval

resources:
  accelerators: T4:1
  use_spot: false

file_mounts:
  /checkpoint:
    name: train-eval # NOTE: Fill in your bucket name
    mode: MOUNT

setup: |
  echo setup for eval

run: |
  echo load trained model from /checkpoint
  echo eval model on test set

The YAML above defines a pipeline with two tasks. The first name: pipeline names the pipeline. The first task has name train and the second task has name eval. The tasks are separated by a line with three dashes ---. Each task has its own resources, setup, and run sections. Tasks are executed sequentially.

To submit the pipeline, the same command sky jobs launch is used. The pipeline will be automatically launched and monitored by SkyPilot. You can check the status of the pipeline with sky jobs queue or sky jobs dashboard.

$ sky jobs launch -n pipeline pipeline.yaml
$ sky jobs queue
Fetching managed jobs...
Managed jobs
In progress jobs: 1 RECOVERING
ID  TASK  NAME      RESOURCES                    SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
8         pipeline  -                            50 mins ago  47m 45s        -             1            RECOVERING
 ↳  0     train     1x [V100:8][Spot|On-demand]  50 mins ago  47m 45s        -             1            RECOVERING
 ↳  1     eval      1x [T4:1]                    -            -              -             0            PENDING

Note

The $SKYPILOT_TASK_ID environment variable is also available in the run section of each task. It is unique for each task in the pipeline. For example, the $SKYPILOT_TASK_ID for the eval task above is: “sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1”.

Job dashboard#

Use sky jobs dashboard to open a dashboard to see all jobs:

$ sky jobs dashboard

This automatically opens a browser tab to show the dashboard:

The UI shows the same information as the CLI sky jobs queue -a. The UI is especially useful when there are many in-progress jobs to monitor, which the terminal-based CLI may need more than one page to display.

Intermediate storage for files#

For managed jobs, SkyPilot requires an intermediate bucket to store files used in the task, such as local file mounts, temporary files, and the workdir. If you do not configure a bucket, SkyPilot will automatically create a temporary bucket named skypilot-filemounts-{username}-{run_id} for each job launch. SkyPilot automatically deletes the bucket after the job completes.

Alternatively, you can pre-provision a bucket and use it as an intermediate for storing file by setting jobs.bucket in ~/.sky/config.yaml:

# ~/.sky/config.yaml
jobs:
  bucket: s3://my-bucket  # Supports s3://, gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>

If you choose to specify a bucket, ensure that the bucket already exists and that you have the necessary permissions.

When using a pre-provisioned intermediate bucket with jobs.bucket, SkyPilot creates job-specific directories under the bucket root to store files. They are organized in the following structure:

# cloud bucket, s3://my-bucket/ for example
my-bucket/
├── job-15891b25/            # Job-specific directory
│   ├── local-file-mounts/   # Files from local file mounts
│   ├── tmp-files/           # Temporary files
│   └── workdir/             # Files from workdir
└── job-cae228be/            # Another job's directory
    ├── local-file-mounts/
    ├── tmp-files/
    └── workdir/

When using a custom bucket (jobs.bucket), the job-specific directories (e.g., job-15891b25/) created by SkyPilot are removed when the job completes.

Tip

Multiple users can share the same intermediate bucket. Each user’s jobs will have their own unique job-specific directories, ensuring that files are kept separate and organized.

How it works: The jobs controller#

The jobs controller is a small on-demand CPU VM or pod running in the cloud that manages all jobs of a user. It is automatically launched when the first managed job is submitted, and it is autostopped after it has been idle for 10 minutes (i.e., after all managed jobs finish and no new managed job is submitted in that duration). Thus, no user action is needed to manage its lifecycle.

You can see the controller with sky status and refresh its status by using the -r/--refresh flag.

While the cost of the jobs controller is negligible (~$0.25/hour when running and less than $0.004/hour when stopped), you can still tear it down manually with sky down <job-controller-name>, where the <job-controller-name> can be found in the output of sky status.

Note

Tearing down the jobs controller loses all logs and status information for the finished managed jobs. It is only allowed when there are no in-progress managed jobs to ensure no resource leakage.

Customizing Jobs Controller Resources#

You may want to customize the resources of the jobs controller for several reasons:

Increasing the maximum number of jobs that can be run concurrently, which is based on the instance size of the controller. (Default: 90, see best practices)
Use a lower-cost controller (if you have a low number of concurrent managed jobs).
Enforcing the jobs controller to run on a specific location. (Default: cheapest location)
Changing the disk_size of the jobs controller to store more logs. (Default: 50GB)

To achieve the above, you can specify custom configs in ~/.sky/config.yaml with the following fields:

jobs:
  # NOTE: these settings only take effect for a new jobs controller, not if
  # you have an existing one.
  controller:
    resources:
      # All configs below are optional.
      # Specify the location of the jobs controller.
      cloud: gcp
      region: us-central1
      # Bump cpus to allow more managed jobs to be launched concurrently. (Default: 4+)
      cpus: 8+
      # Bump memory to allow more managed jobs to be running at once.
      # By default, it scales with CPU (8x).
      memory: 64+
      # Specify the disk_size in GB of the jobs controller.
      disk_size: 100

The resources field has the same spec as a normal SkyPilot job; see here.

Note

These settings will not take effect if you have an existing controller (either stopped or live). For them to take effect, tear down the existing controller first, which requires all in-progress jobs to finish or be canceled.

Best Practices for Scaling Up the Jobs Controller#

Tip

For managed jobs, it’s highly recommended to use service accounts for cloud authentication. This is so that the jobs controller credentials do not expire.

The number of active jobs that the controller supports is based on the controller size. There are two limits that apply:

Actively launching job count: maxes out at 4 * vCPU count. A job counts towards this limit when it is first starting, launching instances, or recovering.
Running job count: maxes out at memory / 350MiB, up to a max of 2000 jobs.

For maximum parallelism, the following configuration is recommended:

jobs:
  controller:
    resources:
      # In our testing, aws > gcp > azure
      cloud: aws
      cpus: 128
      # Azure does not have 128+ CPU instances, so use 96 instead
      # cpus: 96
      memory: 600+
      disk_size: 500

With this configuration, you’ll get the following performance:

Cloud	Instance type	Launching jobs	Running jobs
AWS	r6i.32xlarge	512 launches at once	2000 running at once
GCP	n2-highmem-128	512 launches at once	2000 running at once
Azure	Standard_E96s_v5	384 launches at once	1930 running at once