Many Parallel Jobs#
SkyPilot allows you to easily run many jobs in parallel and manage them in a single system. This is useful for hyperparameter tuning sweeps, data processing, and other batch jobs.
This guide shows a typical workflow for running many jobs with SkyPilot.
Why Use SkyPilot to Run Many Jobs#
Unified: Use any or multiple of your own infrastructure (Kubernetes, cloud VMs, reservations, etc.).
Elastic: Scale up and down based on demands.
Cost-effective: Only pay for the cheapest resources.
Robust: Automatically recover jobs from failures.
Observable: Monitor and manage all jobs in a single pane of glass.
Write a YAML for One Job#
Before scaling up to many jobs, write a SkyPilot YAML for a single job first and ensure it runs correctly. This can save time by avoiding debugging many jobs at once.
Here is the same example YAML as in Tutorial: AI Training:
Click to expand: train.yaml
# train.yaml
name: huggingface
resources:
accelerators: V100:4
setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16
First, launch the job to check it successfully launches and runs correctly:
sky launch -c train train.yaml
If there is any error, you can fix the code and/or the YAML, and launch the job again on the same cluster:
# Cancel the latest job.
sky cancel train -y
# Run the job again on the same cluster.
sky launch -c train train.yaml
Sometimes, it may be more efficient to log into the cluster and interactively debug the job. You can do so by directly ssh’ing into the cluster or using VSCode’s remote ssh.
# Log into the cluster.
ssh train
Next, after confirming the job is working correctly, add (hyper)parameters to the job YAML so that all job variants can be specified.
1. Add Hyperparameters#
To launch jobs with different hyperparameters, add them as environment variables to the SkyPilot YAML, and make your main program read these environment variables:
Updated SkyPilot YAML: train-template.yaml
# train-template.yaml
name: huggingface
envs:
LR: 2e-5
MAX_STEPS: 50
resources:
accelerators: V100:4
setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16
You can now use --env
to launch a job with different hyperparameters:
sky launch -c train train-template.yaml \
--env LR=1e-5 \
--env MAX_STEPS=100
Alternative, store the environment variable values in a dotenv file and use --env-file
to launch:
# configs/job1
LR=1e-5
MAX_STEPS=100
sky launch -c train train-template.yaml \
--env-file configs/job1
2. Logging Job Outputs#
When running many jobs, it is useful to log the outputs of all jobs. You can use tools like W&B for this purpose:
SkyPilot YAML with W&B: train-template.yaml
# train-template.yaml
name: huggingface
envs:
LR: 2e-5
MAX_STEPS: 50
WANDB_API_KEY: # Empty field means this field is required when launching the job.
resources:
accelerators: V100:4
setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install wandb
run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate ${LR} \
--max_steps ${MAX_STEPS} \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16 \
--report_to wandb
You can now launch the job with the following command (WANDB_API_KEY
should existing in your local environment variables).
sky launch -c train train-template.yaml \
--env-file configs/job1 \
--env WANDB_API_KEY
Scale Out to Many Jobs#
With the above setup, you can now scale out to run many jobs in parallel. You can either use SkyPilot CLI with many config files or use SkyPilot Python API.
With CLI and Config Files#
You can run many jobs in parallel by (1) creating multiple config files and (2) submitting them as SkyPilot managed jobs.
First, create a config file for each job (for example, in a configs
directory):
# configs/job-1
LR=1e-5
MAX_STEPS=100
# configs/job-2
LR=2e-5
MAX_STEPS=200
...
An example Python script to generate config files
import os
CONFIG_PATH = 'configs'
LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]
os.makedirs(CONFIG_PATH, exist_ok=True)
job_idx = 1
for lr in LR_CANDIDATES:
for max_steps in MAX_STEPS_CANDIDATES:
config_file = f'{CONFIG_PATH}/job-{job_idx}'
with open(config_file, 'w') as f:
print(f'LR={lr}', file=f)
print(f'MAX_STEPS={max_steps}', file=f)
job_idx += 1
Then, submit all jobs by iterating over the config files and calling sky jobs launch
on each:
for config_file in configs/*; do
job_name=$(basename $config_file)
# -y: yes to all prompts.
# -d: detach from the job's logging, so the next job can be submitted
# without waiting for the previous job to finish.
sky jobs launch -n train-$job_name -y -d train-template.yaml \
--env-file $config_file \
--env WANDB_API_KEY
done
Job statuses can be checked via sky jobs queue
:
$ sky jobs queue
Fetching managed job statuses...
Managed jobs
In progress tasks: 10 RUNNING
ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
10 - train-job10 1x[V100:4] 5 mins ago 5m 5s 1m 12s 0 RUNNING
9 - train-job9 1x[V100:4] 6 mins ago 6m 11s 2m 23s 0 RUNNING
8 - train-job8 1x[V100:4] 7 mins ago 7m 15s 3m 31s 0 RUNNING
...
With Python API#
To have more customized control over generation of job variants, you can also use SkyPilot Python API to launch the jobs.
import os
import sky
LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]
task = sky.Task.from_yaml('train-template.yaml')
job_idx = 1
for lr in LR_CANDIDATES:
for max_steps in MAX_STEPS_CANDIDATES:
task.update_envs({'LR': lr, 'MAX_STEPS': max_steps})
sky.jobs.launch(
task,
name=f'train-job{job_idx}',
detach_run=True,
retry_until_up=True,
)
job_idx += 1