Source: llm/skyrl

SkyRL: Modular Full-Stack RL Training for LLMs#

SkyRL is a modular, performant reinforcement learning library for LLMs, designed for real‑world agentic workloads. Its modular design enables users to modify anything - add new environments, easily implement improvements such as asynchronous training, heterogeneous hardware support, and more!

Why SkyPilot + SkyRL?#

SkyPilot makes RL training with SkyRL easy to run and scale with best cost-efficiency:

Run on any AI infrastructure, including Kubernetes or clouds
Zero setup — one command takes care of provisioning, setting up and run the training.

Quick Start#

Launch a multi‑node GRPO training job on GSM8K using the cheapest available GPUs:

export WANDB_API_KEY="xxx"
sky launch -c skyrl skyrl_train/examples/gsm8k/gsm8k-grpo-skypilot.yaml --secret WANDB_API_KEY

Monitor training progress:

sky logs skyrl

SkyPilot logs

Logs of the training jobs

You can also view the job status in the SkyPilot Dashboard:

sky dashboard

SkyPilot Dashboard

Dashboard showing the status of the training job

If Weights & Biases (W&B) is configured, you can monitor the training run:

W&B training metrics

Key Features#

Modular design: plug‑and‑play algorithms, environments, and hardware backends
Scales from a single GPU to multi‑node clusters via Ray + SkyPilot
Minimal boilerplate: add new environments quickly (often <100 LoC)

Learn More#

Included files#

train.yaml

# Run this from the repository root after cloning SkyRL.
resources:
  infra: aws # replace this with what cloud you want to launch on
  accelerators: L40S:4 # 4x 48 GB GPUs; adjust to the SKU that matches your quota
  memory: 64+ # every node has at least 64 GB memory
  ports: 6479 # expose port for ray dashboard
#  network_tier: best # when using multiple nodes, communication can become a bottleneck

num_nodes: 2 # cluster size

# Set to a git repository
workdir:
  url: https://github.com/NovaSky-AI/SkyRL.git
  ref: main

# Set secrets
secrets:
  WANDB_API_KEY: null

envs:
  LOGGER: "wandb" # change to "console" to print to stdout
  INFERENCE_BACKEND: "vllm"
  # INFERENCE_BACKEND: "sglang"
  NUM_EPOCHS: 20

# Commands run on each node of the remote cluster to set up the environment (e.g., install dependencies). These are run directly inside Docker.
setup: |
  cd skyrl-train
  uv venv --python 3.12 --seed
  source .venv/bin/activate
  uv sync --extra vllm
  uv pip install wandb
  uv run -- python examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k

# If you already have processed the data locally, delete the above line

# The actual task commands to be executed on the remote cluster.
# This script will first start the Ray cluster (different ray start commands are executed on Head and Worker nodes).
# Then, your training script will only be run on the Head node (SKYPILOT_NODE_RANK == 0).
run: |
  set -euo pipefail

  cd skyrl-train
  source .venv/bin/activate

  TMP_DIR="$HOME/skyrl-tmp"
  mkdir -p "$TMP_DIR"
  export TMPDIR="$TMP_DIR"

  read -r head_ip _ <<< "$SKYPILOT_NODE_IPS"
  DATA_DIR="$HOME/data/gsm8k"

  # Login to Weights & Biases once the secrets are available.
  uv run -- python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"

  wait_for_ray() {
    local address=$1
    for _ in $(seq 1 24); do
      if ray status --address "$address" >/dev/null 2>&1; then
        return 0
      fi
      sleep 5
    done
    echo "Ray cluster at $address failed to become ready" >&2
    return 1
  }

  export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook

  # This script is only available on skypilot-nightly>=1.0.0.dev20251114
  # If you are using an older version, you can copy and paste the script from:
  # https://github.com/skypilot-org/skypilot/blob/master/sky_templates/ray/start_cluster
  export RAY_HEAD_PORT=6479
  ~/sky_templates/ray/start_cluster

  # Head node: wait for workers and run training
  if [ "$SKYPILOT_NODE_RANK" = "0" ]; then
    # Ensure rollout GPUs match policy GPUs when colocating all models.
    # With colocate_all=true, we must have:
    #   policy_num_gpus_per_node * policy_num_nodes ==
    #   num_inference_engines * inference_engine_tensor_parallel_size * inference_engine_data_parallel_size
    # Given TP=1 and DP=1, set num_inference_engines to total policy GPUs across the cluster.
    TOTAL_POLICY_GPUS=$((SKYPILOT_NUM_GPUS_PER_NODE * SKYPILOT_NUM_NODES))
    uv run --isolated --extra "$INFERENCE_BACKEND" -m skyrl_train.entrypoints.main_base \
      data.train_data="['${DATA_DIR}/train.parquet']" \
      data.val_data="['${DATA_DIR}/validation.parquet']" \
      trainer.algorithm.advantage_estimator="grpo" \
      trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
      trainer.placement.colocate_all=true \
      trainer.strategy=fsdp2 \
      trainer.placement.policy_num_gpus_per_node="$SKYPILOT_NUM_GPUS_PER_NODE" \
      trainer.placement.ref_num_gpus_per_node="$SKYPILOT_NUM_GPUS_PER_NODE" \
      trainer.placement.ref_num_nodes="$SKYPILOT_NUM_NODES" \
      trainer.placement.policy_num_nodes="$SKYPILOT_NUM_NODES" \
      generator.num_inference_engines="$TOTAL_POLICY_GPUS" \
      generator.inference_engine_tensor_parallel_size=1 \
      trainer.epochs=${NUM_EPOCHS} \
      trainer.eval_batch_size=1024 \
      trainer.eval_before_train=true \
      trainer.eval_interval=5 \
      trainer.update_epochs_per_batch=1 \
      trainer.train_batch_size=1024 \
      trainer.policy_mini_batch_size=256 \
      trainer.micro_forward_batch_size_per_gpu=40 \
      trainer.micro_train_batch_size_per_gpu=40 \
      trainer.ckpt_interval=10 \
      trainer.max_prompt_length=512 \
      generator.sampling_params.max_generate_length=1024 \
      trainer.policy.optimizer_config.lr=1.0e-6 \
      trainer.algorithm.use_kl_loss=true \
      generator.backend="$INFERENCE_BACKEND" \
      generator.run_engines_locally=true \
      generator.weight_sync_backend=nccl \
      generator.async_engine=true \
      generator.batched=true \
      environment.env_class=gsm8k \
      generator.n_samples_per_prompt=5 \
      generator.gpu_memory_utilization=0.8 \
      trainer.logger="$LOGGER" \
      trainer.project_name="gsm8k" \
      trainer.run_name="gsm8k_test" \
      trainer.resume_mode=null \
      trainer.ckpt_path="$HOME/ckpts/gsm8k_1.5B_ckpt"
  fi

  echo "Node setup and Ray start script finished for rank ${SKYPILOT_NODE_RANK}."