Source: llm/verl
Verl: State-of-the-art RL Training for LLMs#
Verl is the most popular open-source reinforcement learning framework for LLMs, supporting PPO, GRPO, and other algorithms.
Also see search-tooling/ and this blog for tool-augmented “search” workflows (Search-R1 style), including Google Search–backed inference and a Wikipedia FAISS retrieval service used for inference and training.
Why SkyPilot + Verl?#
SkyPilot makes RL training easy and cost-effective:
Get GPUs instantly across clouds and Kubernetes
3x cheaper with managed spot instances
Zero setup - handles distributed Ray clusters automatically
Quick Start#
Launch single node agent training:
sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -y
sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -y
Launch a 2-node RLHF training job on the cheapest available GPUs:
sky launch -c verl llm/verl/multinode.yaml
Monitor training progress:
sky logs verl
Training logs showing PPO optimization progress with reward metrics
Access Ray dashboard:
sky status --endpoint 8280 verl
Ray dashboard showing real-time monitoring of distributed training across multiple nodes
Learn More#
Included files#
code/preprocess_rstar_coder.py
# Copyright 2025 MIT
"""
Preprocess rStar-Coder dataset.
"""
import argparse
import os
import datasets
from verl.utils.hdfs_io import copy
from verl.utils.hdfs_io import makedirs
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_dir", default=None)
parser.add_argument("--hdfs_dir", default=None)
parser.add_argument("--local_save_dir", default="~/data/rstar_coder")
args = parser.parse_args()
data_source = "microsoft/rStar-Coder"
dataset = datasets.load_dataset(
data_source,
data_files="synthetic_sft/data-00000-of-00015.parquet",
split="train",
trust_remote_code=True)
data_source = 'openai/gsm8k'
# Split into train/test (90/10)
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
instruction_following = 'Let\'s think step by step and output the final answer after "####".'
def make_map_fn(split):
def process_fn(doc, idx):
question_raw = doc.get("question", "")
question = question_raw + " " + instruction_following
answer = doc.get("response") or doc.get("code", "")
data = {
"data_source": data_source,
"prompt": [{
"role": "user",
"content": question
}],
"ability": "code",
"reward_model": {
"style": "rule",
"ground_truth": answer
},
"extra_info": {
"split": split,
"index": idx,
}
}
return data
return process_fn
train_dataset = train_dataset.map(function=make_map_fn("train"),
with_indices=True)
test_dataset = test_dataset.map(function=make_map_fn("test"),
with_indices=True)
hdfs_dir = args.hdfs_dir
local_save_dir = args.local_dir
if local_save_dir is not None:
print(
"Warning: Argument 'local_dir' is deprecated. Please use 'local_save_dir' instead."
)
else:
local_save_dir = args.local_save_dir
train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))
if hdfs_dir is not None:
makedirs(hdfs_dir)
copy(src=local_save_dir, dst=hdfs_dir)
multinode.yaml
# Multi-node distributed training with Verl (Volcano Engine Reinforcement Learning) framework.
#
# Verl is a flexible and efficient reinforcement learning framework designed for
# training large language models with RLHF (Reinforcement Learning from Human Feedback).
# This example demonstrates multi-node training using PPO on the GSM8K dataset.
#
# Prerequisites:
# - GPU nodes with at least 40GB memory (e.g., A100)
# - Access to Hugging Face models (Qwen/Qwen2.5-0.5B-Instruct in this example)
#
# Usage:
# # Launch a 2-node training cluster:
# $ sky launch -c verl-cluster examples/verl/multinode.yaml
#
# # Monitor the Ray dashboard (optional):
# $ sky status --endpoint 8280 verl-cluster
#
# # Stream logs:
# $ sky logs verl-cluster
#
# # Cleanup:
# $ sky down verl-cluster
name: verl-multinode-training
resources:
accelerators:
- A100:8
- A100-80GB:8
- H100:8 # H100 for faster training, can also use A100-80GB:1
# cloud: lambda # Optional: specify cloud provider
use_spot: false # Set to true to use spot instances with managed jobs
ports:
- 8280 # Ray dashboard port
num_nodes: 2 # Number of nodes for distributed training
# Environment variables
envs:
HF_HUB_ENABLE_HF_TRANSFER: "1"
TORCH_NCCL_AVOID_RECORD_STREAMS: "1"
# Change this to your own checkpoint bucket
CHECKPOINT_BUCKET_NAME: sky-verl-checkpoints
# Optional: Add your W&B API key for experiment tracking
WANDB_API_KEY: null # Pass with `--secret WANDB_API_KEY` in CLI
# Training configuration
MODEL_NAME: Qwen/Qwen2.5-0.5B-Instruct
TOTAL_EPOCHS: 3
ACTOR_LR: 1e-6
CRITIC_LR: 1e-5
# Mount cloud storage for checkpoints
file_mounts:
/checkpoints:
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
# Optionally, specify the store to enforce to use one of the stores below:
# r2/azure/gcs/s3/cos
# store: s3
setup: |
# Clone and setup Verl
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
# Create virtual environment and install dependencies
uv venv --seed
source .venv/bin/activate
# Install Verl and its dependencies (skip Megatron for this example)
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
uv pip install --no-deps -e .
uv pip install "ray[default]" # For Ray dashboard
# Pin uvloop to 0.21.0 to work around asyncio event loop bug
# See: https://github.com/volcengine/verl/issues/3806
uv pip install "uvloop==0.21.0"
run: |
# Set up distributed training environment
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l)
echo "Head IP: $head_ip"
echo "Number of nodes: $num_nodes"
cd verl
source .venv/bin/activate
# Create custom runtime environment configuration
cat > runtime_env_custom.yaml <<EOF
working_dir: ./
excludes: ["/.git/", "*.whl", "**/*.whl"]
env_vars:
TORCH_NCCL_AVOID_RECORD_STREAMS: "1"
CUDA_DEVICE_MAX_CONNECTIONS: "1"
HF_HUB_ENABLE_HF_TRANSFER: "1"
EOF
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
# Head node: prepare data, download model, start Ray head, and submit training job
echo "Setting up head node..."
# Install additional dependencies for data processing
uv pip install datasets transformers
# Prepare GSM8K dataset
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
# Download model to cache
python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
fi
# This script is only available on skypilot-nightly>=1.0.0.dev20251114
# If you are using an older version, you can copy and paste the script from:
# https://github.com/skypilot-org/skypilot/blob/master/sky_templates/ray/start_cluster
export RAY_HEAD_PORT=6385
export RAY_DASHBOARD_PORT=8280
export RAY_DASHBOARD_HOST=0.0.0.0
export RAY_DASHBOARD_AGENT_LISTEN_PORT=52366
export RAY_HEAD_IP_ADDRESS="$head_ip"
~/sky_templates/ray/start_cluster
# Head node: submit training job
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
export RAY_ADDRESS="http://localhost:8280"
echo "Submitting training job to Ray cluster..."
ray job submit --address="$RAY_ADDRESS" --working-dir=. \
--runtime-env=runtime_env_custom.yaml \
-- python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=128 \
data.max_prompt_length=512 \
data.max_response_length=256 \
actor_rollout_ref.model.path=$MODEL_NAME \
actor_rollout_ref.actor.optim.lr=$ACTOR_LR \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
critic.optim.lr=$CRITIC_LR \
critic.model.path=$MODEL_NAME \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.ppo_mini_batch_size=64 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.project_name=ppo_training \
trainer.experiment_name=qwen-2.5-0.5B \
trainer.val_before_train=False \
trainer.n_gpus_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
trainer.nnodes=$num_nodes \
trainer.default_local_dir=/checkpoints \
trainer.save_freq=10 \
trainer.test_freq=10 \
trainer.total_epochs=$TOTAL_EPOCHS \
trainer.logger=['console'] \
trainer.resume_mode=auto 2>&1 | tee verl_training.log
# To enable W&B logging:
# 1. Set WANDB_API_KEY in envs or pass via --secret WANDB_API_KEY
# 2. Change trainer.logger to: trainer.logger=['console', 'wandb']
# 3. Add: trainer.wandb_project='verl-rlhf'
# 4. Add: trainer.wandb_run_name='${SKYPILOT_CLUSTER_NAME}-${SKYPILOT_TASK_ID}'
fi
search-tooling/README.md
Search tooling for VERL
This folder contains SkyPilot YAMLs for training and inference with tool-augmented “search” workflows (Search-R1 style), using either:
a Google Search backend, or
a Wikipedia retrieval service (FAISS index).
See this blog for how the YAMLs are used for training a RL agent that can use Google search.
Inference (Google Search backend)
sky launch -c verl-infer-google llm/verl/search-tooling/verl-search-interaction-google-search.yaml \
--env MODEL_PATH=/checkpoints/hf_model \
--env GOOGLE_API_KEY=your_key_here \
--env GOOGLE_CSE_ID=your_cse_id_here \
-y
Inference (local Wikipedia retrieval on the same node)
sky launch -c verl-infer llm/verl/search-tooling/verl-search-interaction-infer.yaml \
--env MODEL_PATH=/checkpoints/hf_model \
-y
Retrieval service (CPU-only, for reuse across jobs)
sky serve up -n retrieval llm/verl/search-tooling/verl-search-interaction-retrieval.yaml --cpus 32+ --memory 256+ -y
sky serve status retrieval --endpoint 8000
Training
Single-node training with retrieval running on the same node:
llm/verl/search-tooling/verl-search-interaction.yamlTraining that points to an external retrieval service:
llm/verl/search-tooling/verl-search-interaction-rl-trainer.yaml
search-tooling/verl-search-interaction-google-search.yaml
# Search Tool Interaction Inference (Google Search backend)
#
# This example demonstrates inference using Search-R1 with a search/retrieval tool.
# The model uses a Google Search–backed tool for answering questions that require external knowledge.
# Both the Google search server and inference run on the same node.
#
# Usage:
# sky launch -c verl-infer-google llm/verl/verl-search-interaction-google-infer.yaml \
# --env MODEL_PATH=/checkpoints/hf_model \
# --env GOOGLE_API_KEY=your_key_here \
# --env GOOGLE_CSE_ID=your_cse_id_here \
# -y
#
# Requirements:
# - Single GPU for inference
# - Valid Google Programmable Search Engine (CSE) + API key
resources:
accelerators: H100:1
memory: 128+
ports:
- 8000 # Google search server
num_nodes: 1
envs:
MODEL_PATH: "" # Optional: Path to model checkpoint (defaults to base model)
GOOGLE_API_KEY: "" # Required: Google API key
GOOGLE_CSE_ID: "" # Required: Google Custom Search Engine ID
CHECKPOINT_BUCKET_NAME: verl-search-interaction-checkpoints
file_mounts:
/checkpoints:
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
setup: |
set -e
echo "=== Search Tool Inference Setup (Google Search) ==="
# System dependencies
echo "Installing system dependencies..."
sudo apt update && sudo apt install -y iproute2 git
# Python environment
echo "Setting up Python virtual environment..."
uv venv --python 3.10 --seed
source .venv/bin/activate
echo "Installing PyTorch..."
uv pip install "torch==2.8.*" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Clone VERL repository (if infer.py relies on its code / configs)
echo "Cloning VERL repository..."
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout v0.6.0
echo "Installing VERL + SGLang dependencies..."
uv pip install -v -e .
uv pip install wheel
uv pip install packaging
uv pip install -r ./requirements_sglang.txt
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
cd ..
# Clone Search-R1 for inference
echo "Cloning Search-R1 repository..."
rm -rf Search-R1
git clone https://github.com/PeterGriffinJin/Search-R1.git
# Install additional inference dependencies
cd Search-R1
if [ -f requirements.txt ]; then
echo "Installing Search-R1 requirements..."
uv pip install -r requirements.txt
fi
# Ensure Google API client is available (if not already pulled in)
uv pip install google-api-python-client
cd ..
echo "✓ Inference setup complete!"
run: |
set -e
echo "=== Search Tool Inference (Google Search backend) ==="
# Activate environment
source .venv/bin/activate
# Sanity check env vars
if [ -z "$GOOGLE_API_KEY" ] || [ -z "$GOOGLE_CSE_ID" ]; then
echo "ERROR: GOOGLE_API_KEY and GOOGLE_CSE_ID must be set via --env."
exit 1
fi
echo "Using GOOGLE_API_KEY: (set)"
echo "Using GOOGLE_CSE_ID: (set)"
# Start Google search server in background
cd ~/sky_workdir/Search-R1
echo "Starting Google search server on port 8000..."
python search_r1/search/google_search_server.py \
--api_key "$GOOGLE_API_KEY" \
--cse_id "$GOOGLE_CSE_ID" \
> google_search_server.log 2>&1 &
RETRIEVAL_PID=$!
echo "Google search server PID: $RETRIEVAL_PID"
# Give the server a moment to start
sleep 10
# (Optional) basic health check if the server exposes one
# curl -f http://127.0.0.1:8000/health || echo "Healthcheck failed (continuing anyway)"
# Run inference
echo "Running infer.py..."
if [ -n "$MODEL_PATH" ]; then
# If your infer.py supports a flag, use it; otherwise it may read MODEL_PATH from env.
python infer.py --model_path "$MODEL_PATH" || python infer.py
else
python infer.py
fi
echo "✓ Inference finished"
# Clean up search server (SkyPilot will tear down the node afterwards anyway)
if ps -p $RETRIEVAL_PID > /dev/null 2>&1; then
echo "Stopping Google search server..."
kill $RETRIEVAL_PID || true
fi
echo "=== Done ==="
search-tooling/verl-search-interaction-infer.yaml
# Search Tool Interaction Inference
#
# This example demonstrates inference using Search-R1 with a search/retrieval tool.
# The model uses a search tool for answering questions that require external knowledge.
# Both retrieval service and inference run on the same node.
#
# Usage:
# sky launch -c verl-infer llm/verl/verl-search-interaction-infer.yaml --env MODEL_PATH=/checkpoints/hf_model -y
#
# Requirements:
# - Single GPU for inference
# - Sufficient memory for retrieval index
resources:
accelerators: H100:1
memory: 128+
ports:
- 8000 # Retrieval service
num_nodes: 1
envs:
MODEL_PATH: "" # Optional: Path to model checkpoint (defaults to base model)
RETRIEVAL_TOPK: 3
RETRIEVER_NAME: e5
RETRIEVER_MODEL: intfloat/e5-base-v2
CHECKPOINT_BUCKET_NAME: verl-search-interaction-checkpoints
file_mounts:
/checkpoints:
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
setup: |
set -e
echo "=== Search Tool Inference Setup ==="
# System dependencies
echo "Installing system dependencies..."
sudo apt update && sudo apt install -y iproute2
# Python environment
echo "Setting up Python virtual environment..."
uv venv --python 3.10 --seed
source .venv/bin/activate
# Install dependencies
echo "Installing PyTorch and dependencies..."
uv pip install "torch==2.8.*" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
uv pip install -v -e .
uv pip install wheel
uv pip install packaging
uv pip install -r ./requirements_sglang.txt
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
# Download Wikipedia corpus and FAISS index
echo "Downloading Wikipedia corpus and FAISS index..."
export save_path=~/dataset
mkdir -p $save_path
huggingface-cli download maknee/wiki-18-subsets wiki-18-100k.jsonl.gz --repo-type=dataset --local-dir $save_path
huggingface-cli download maknee/wiki-18-subsets e5_Flat-100k.index --repo-type=dataset --local-dir $save_path
# Move files to expected locations
mv $save_path/wiki-18-100k.jsonl.gz $save_path/wiki-18.jsonl.gz
mv $save_path/e5_Flat-100k.index $save_path/e5_Flat.index
# Decompress the JSONL file
gzip -d $save_path/wiki-18.jsonl.gz -f
# Clone VERL repository
echo "Cloning VERL repository..."
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout v0.6.0
cd ..
# Clone Search-R1 for inference
echo "Cloning Search-R1 repository..."
rm -rf Search-R1
git clone https://github.com/PeterGriffinJin/Search-R1/
# Install additional inference dependencies if needed
cd Search-R1
if [ -f requirements.txt ]; then
uv pip install -r requirements.txt
fi
cd ..
echo "✓ Inference setup complete!"
run: |
set -e
echo "=== Search Tool Inference ==="
# Activate environment
source .venv/bin/activate
# Set up paths
save_path=~/dataset
index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
# Start retrieval server in background
echo "Starting retrieval server on port 8000..."
cd verl
python examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py \
--index_path $index_file \
--corpus_path $corpus_file \
--topk $RETRIEVAL_TOPK \
--retriever_name $RETRIEVER_NAME \
--retriever_model $RETRIEVER_MODEL &
RETRIEVAL_PID=$!
sleep 10
# Run inference
cd ~/sky_workdir/Search-R1
python infer.py
search-tooling/verl-search-interaction-retrieval.yaml
# Search Tool Retrieval Service
#
# This service provides Wikipedia retrieval capabilities using FAISS indexing.
# It runs on CPU nodes and exposes a retrieval API on port 8000.
#
# Usage:
# sky launch -c retrieval llm/verl/verl-search-interaction-retrieval.yaml --cpus 32+ --memory 256+ -y
#
# Get endpoint:
# sky status retrieval --endpoint 8000
#
# OR with sky serve
# sky serve up -n retrieval llm/verl/verl-search-interaction-retrieval.yaml --cpus 32+ --memory 256+ -y
#
# Get endpoint:
# sky serve status retrieval --endpoint 8000
service:
readiness_probe: /
replicas: 3
resources:
cpus: 32+
memory: 256+
use_spot: false
ports:
- 8000 # Retrieval service API
num_nodes: 1
envs:
RETRIEVAL_TOPK: 3
RETRIEVER_NAME: e5
RETRIEVER_MODEL: intfloat/e5-base-v2
setup: |
set -e
echo "=== Retrieval Service Setup ==="
# System dependencies
echo "Installing system dependencies..."
sudo apt update && sudo apt install -y iproute2
# Python environment
echo "Setting up Python virtual environment..."
uv venv --python 3.10 --seed
source .venv/bin/activate
# Install retrieval service dependencies
echo "Installing retrieval service dependencies..."
uv pip install "torch==2.8.*" torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
uv pip install transformers datasets huggingface_hub
uv pip install faiss-cpu
uv pip install uvicorn fastapi uvloop==0.21.0
# Download Wikipedia corpus and FAISS index
echo "Downloading Wikipedia corpus and FAISS index..."
export save_path=~/dataset
mkdir -p $save_path
huggingface-cli download maknee/wiki-18-subsets wiki-18-100k.jsonl.gz --repo-type=dataset --local-dir $save_path
huggingface-cli download maknee/wiki-18-subsets e5_Flat-100k.index --repo-type=dataset --local-dir $save_path
# Move files to expected locations
mv $save_path/wiki-18-100k.jsonl.gz $save_path/wiki-18.jsonl.gz
mv $save_path/e5_Flat-100k.index $save_path/e5_Flat.index
# Decompress the JSONL file
gzip -d $save_path/wiki-18.jsonl.gz -f
# Clone VERL repository for retrieval server code
echo "Cloning repositories..."
git clone https://github.com/volcengine/verl.git
cd verl
git checkout v0.6.0
# Patch retrieval server for CPU-only usage (comment out CUDA calls)
echo "Patching retrieval server for CPU-only usage..."
sed -i 's/^\(\s*\)\(model\.cuda()\)/\1# \2 # Commented out for CPU-only deployment/' \
examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py
sed -i 's/^\(\s*\)\(inputs = {k: v\.cuda() for k, v in inputs\.items()}\)/\1# \2 # Commented out for CPU-only deployment/' \
examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py
cd ..
echo "✓ Retrieval service setup complete!"
run: |
set -e
echo "=== Starting Retrieval Service ==="
# Activate environment
source .venv/bin/activate
# Set up paths
save_path=~/dataset
index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
# Start retrieval server
echo "Starting retrieval server on port 8000..."
cd verl
python examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py \
--index_path $index_file \
--corpus_path $corpus_file \
--topk $RETRIEVAL_TOPK \
--retriever_name $RETRIEVER_NAME \
--retriever_model $RETRIEVER_MODEL &
echo "✓ Retrieval service running on port 8000"
search-tooling/verl-search-interaction-rl-trainer.yaml
# Search Tool Interaction Training with VERL (RL Trainer)
#
# This example demonstrates multi-turn tool interaction training using VERL with a search/retrieval tool.
# The model learns to use a search tool for answering questions that require external knowledge.
#
# Requires a separate retrieval service running (see verl-search-interaction-retrieval.yaml)
#
# Based on: https://verl.readthedocs.io/en/v0.5.x/sglang_multiturn/search_tool_example.html
#
# Usage:
# # 1. Launch retrieval service first
# sky launch -c retrieval llm/verl/verl-search-interaction-retrieval.yaml --cpus 32+ --memory 256+ -y
#
# # 2. Get retrieval service endpoint
# RETRIEVAL_IP=$(sky status retrieval --endpoint 8000)
#
# # 3. Launch training (without WandB)
# sky launch -c verl-train llm/verl/verl-search-interaction-rl-trainer.yaml --env RETRIEVAL_SERVICE_URL=http://$RETRIEVAL_IP --env DATASET_SIZE=small --env TOTAL_EPOCHS=1 -y
#
# # Or with WandB logging (optional)
# sky launch -c verl-train llm/verl/verl-search-interaction-rl-trainer.yaml --env RETRIEVAL_SERVICE_URL=http://$RETRIEVAL_IP --env DATASET_SIZE=small --env TOTAL_EPOCHS=1 --secret WANDB_API_KEY -y
#
# Requirements:
# - Docker with SYS_PTRACE capability (for PyTorch multiprocessing CUDA tensor sharing)
# - H100 GPUs (can be adjusted for other accelerators)
# - Running retrieval service at RETRIEVAL_SERVICE_URL
resources:
accelerators: H100:1
memory: 128+
image_id: docker:verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
ports:
- 8265 # Ray dashboard
- 9090 # vLLM model serving
num_nodes: 1
config:
docker:
run_options:
- --cap-add=SYS_PTRACE # Required for PyTorch CUDA tensor sharing between Ray workers
- --ipc=host
- --shm-size=16g
envs:
RETRIEVAL_SERVICE_URL: "" # Required: URL of the retrieval service (e.g., http://retrieval-ip:8000)
DATASET_SIZE: small # Options: small (1000 train, 200 test), medium (10k train, 2k test), full
TOTAL_EPOCHS: 1
TOTAL_STEPS: 10
TRAIN_BATCH_SIZE: 512
VAL_BATCH_SIZE: 256
SAVE_FREQ: 5 # Save checkpoints every 5 steps
TEST_FREQ: 5 # Test every 5 steps
MODEL_NAME: Qwen/Qwen2.5-3B-Instruct
WANDB_PROJECT_NAME: search_r1_like_async_rl
WANDB_EXPERIMENT_NAME: qwen2.5-3b-it_rm-searchR1-like-sgl-multiturn
CHECKPOINT_BUCKET_NAME: nebius://verl-search-interaction-checkpoints
file_mounts:
/checkpoints:
source: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT_CACHED
secrets:
WANDB_API_KEY: "" # Optional: Set to enable WandB logging. If not set, only console logging will be used.
setup: |
rm -f ~/.pip/pip.conf
rm -f ~/.config/pip/pip.conf
set -e
echo "=== VERL Search Tool Interaction Training Setup ==="
# Validate required environment variables
if [ -z "$RETRIEVAL_SERVICE_URL" ]; then
echo "ERROR: RETRIEVAL_SERVICE_URL environment variable is required"
echo "Example: --env RETRIEVAL_SERVICE_URL=http://retrieval-ip:8000"
exit 1
fi
# Python environment
echo "Setting up Python virtual environment..."
uv venv --python 3.10 --seed
source .venv/bin/activate
# Clone VERL repository
echo "Cloning VERL repository..."
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout v0.6.0
# Core dependencies
echo "Installing PyTorch and VERL..."
uv pip install "torch==2.8.*" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
uv pip install -v -e .
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
uv pip install wheel
uv pip install packaging
uv pip install -r ./requirements_sglang.txt
# Install uvloop (required version)
uv pip install uvloop==0.21.0
# Data preparation
echo "Preparing search R1 dataset..."
python3 examples/data_preprocess/preprocess_search_r1_dataset.py
# Clone Search-R1 for additional utilities
git clone https://github.com/PeterGriffinJin/Search-R1/
# Update tool config to use external retrieval service
echo "Configuring external retrieval service..."
TOOL_CONFIG="examples/sglang_multiturn/config/tool_config/search_tool_config.yaml"
# Backup original config
cp $TOOL_CONFIG ${TOOL_CONFIG}.bak
# Update retrieval URL and num_workers in the config
sed -i 's/num_workers: *120/num_workers: 8/' $TOOL_CONFIG
sed -i "s|http://127\.0\.0\.1:8000/retrieve|$RETRIEVAL_SERVICE_URL/retrieve|g" $TOOL_CONFIG
sed -i "s|http://localhost:8000|$RETRIEVAL_SERVICE_URL|g" $TOOL_CONFIG
echo "✓ Setup complete!"
echo "Dataset location: ~/data/searchR1_processed_direct/"
echo "VERL repository: $(pwd)"
echo "Retrieval service: $RETRIEVAL_SERVICE_URL"
run: |
set -e
echo "=== VERL Search Tool Interaction Training ==="
sudo apt update && sudo apt install -y iproute2 npm
# Validate retrieval service
if [ -z "$RETRIEVAL_SERVICE_URL" ]; then
echo "ERROR: RETRIEVAL_SERVICE_URL environment variable is required"
exit 1
fi
echo "Testing connection to retrieval service at $RETRIEVAL_SERVICE_URL..."
# Give it a few retries in case the service is still starting
max_retries=30
retry_count=0
while [ $retry_count -lt $max_retries ]; do
# Test the /retrieve endpoint with a sample query
test_response=$(curl -s -X POST "${RETRIEVAL_SERVICE_URL}/retrieve" \
-H "Content-Type: application/json" \
-d '{"queries": ["test query"], "topk": 1, "return_scores": false}' \
-w "\n%{http_code}" 2>&1)
http_code=$(echo "$test_response" | tail -n1)
if [ "$http_code" = "200" ]; then
echo "✓ Successfully connected to retrieval service"
echo "✓ /retrieve endpoint is responding correctly"
break
fi
retry_count=$((retry_count+1))
if [ $retry_count -eq $max_retries ]; then
echo "WARNING: Could not connect to retrieval service at $RETRIEVAL_SERVICE_URL"
echo "Make sure the retrieval service is running and accessible"
echo "Last response code: $http_code"
fi
sleep 5
done
# Multi-node setup
HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
NUM_NODES=$SKYPILOT_NUM_NODES
NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
# Network configuration for distributed training
NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'dev \K\S+')
export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
export NCCL_SOCKET_IFNAME=$NETWORK_INTERFACE
# PyTorch multiprocessing configuration
export TORCH_MULTIPROCESSING_SHARING_STRATEGY=file_system
# Activate environment
source .venv/bin/activate
# Set up paths
cd verl
PROJECT_DIR="$(pwd)"
export PYTHONPATH="$PROJECT_DIR:$PYTHONPATH"
# WandB login (optional)
if [ -n "$WANDB_API_KEY" ]; then
echo "Logging into Weights & Biases..."
python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
fi
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
echo "Starting Ray head node on port 6379..."
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265
# Wait for all nodes to connect
echo "Waiting for $NUM_NODES nodes to connect..."
retry_count=0
max_retries=30
while [ $retry_count -lt $max_retries ]; do
connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
echo "✓ All $NUM_NODES nodes connected"
break
fi
retry_count=$((retry_count+1))
sleep 10
done
# Display Ray cluster status
echo "Ray cluster status:"
ray status
echo "Starting search tool interaction training..."
cd $PROJECT_DIR
# Increase file descriptor limit
ulimit -n 65535
# Set up configuration paths
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
TRAIN_DATA="$HOME/data/searchR1_processed_direct/train.parquet"
VAL_DATA="$HOME/data/searchR1_processed_direct/test.parquet"
TOOL_CONFIG="$CONFIG_PATH/tool_config/search_tool_config.yaml"
# Configure logging based on WANDB_API_KEY availability
if [ -n "$WANDB_API_KEY" ]; then
LOGGER_CONFIG='["console","wandb"]'
WANDB_ARGS="trainer.project_name=$WANDB_PROJECT_NAME trainer.experiment_name=$WANDB_EXPERIMENT_NAME"
echo "✓ WandB logging enabled"
else
LOGGER_CONFIG='["console"]'
WANDB_ARGS=""
echo "ℹ WandB logging disabled (no API key provided)"
fi
# Training with search tool
python3 -m verl.trainer.main_ppo \
--config-path="$CONFIG_PATH" \
--config-name='search_multiturn_grpo' \
algorithm.adv_estimator=grpo \
data.train_batch_size=$TRAIN_BATCH_SIZE \
data.val_batch_size=$VAL_BATCH_SIZE \
data.max_prompt_length=4096 \
data.max_response_length=3000 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.return_raw_chat=True \
actor_rollout_ref.model.path=$MODEL_NAME \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.285 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.rollout.max_model_len=15000 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=2 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.val_before_train=False \
trainer.logger="$LOGGER_CONFIG" \
$WANDB_ARGS \
trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
trainer.nnodes=$NUM_NODES \
trainer.save_freq=$SAVE_FREQ \
trainer.test_freq=$TEST_FREQ \
data.train_files="$TRAIN_DATA" \
data.val_files="$VAL_DATA" \
actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
trainer.total_epochs=$TOTAL_EPOCHS \
trainer.total_training_steps=$TOTAL_STEPS \
trainer.default_local_dir=/checkpoints
echo "✓ Training complete!"
# Model checkpoint merging
echo "Merging model checkpoints..."
LATEST_STEP=$(cat /checkpoints/latest_checkpointed_iteration.txt)
CHECKPOINT_DIR="/checkpoints/global_step_${LATEST_STEP}/actor"
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir /checkpoints/hf_model
echo "✓ Model saved to /checkpoints/hf_model"
echo "Training artifacts saved to cloud bucket: ${CHECKPOINT_BUCKET_NAME}"
else
# Worker node setup
echo "Worker node (rank $SKYPILOT_NODE_RANK) connecting to head at $HEAD_IP:6379..."
sleep 15
ps aux | grep ray | grep $HEAD_IP:6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
echo "✓ Worker node connected"
sleep infinity
fi
search-tooling/verl-search-interaction.yaml
# Search Tool Interaction Training with VERL
#
# This example demonstrates multi-turn tool interaction training using VERL with a search/retrieval tool.
# The model learns to use a search tool for answering questions that require external knowledge.
#
# Based on: https://verl.readthedocs.io/en/v0.5.x/sglang_multiturn/search_tool_example.html
#
# Usage:
# # Without WandB logging
# sky launch -c verl-search llm/verl/verl-search-interaction.yaml --env DATASET_SIZE=small --env TOTAL_EPOCHS=1 -y
#
# # Or with WandB logging (optional)
# sky launch -c verl-search llm/verl/verl-search-interaction.yaml --secret WANDB_API_KEY --env DATASET_SIZE=small --env TOTAL_EPOCHS=1 -y
#
# Requirements:
# - Docker with SYS_PTRACE capability (for PyTorch multiprocessing CUDA tensor sharing)
# - Single H100 or equivalent GPU (can be adjusted for other accelerators)
resources:
accelerators: H100:1
memory: 128+
image_id: docker:verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
ports:
- 8265 # Ray dashboard
- 8000 # Retrieval service
num_nodes: 1
config:
docker:
run_options:
- --cap-add=SYS_PTRACE # Required for PyTorch CUDA tensor sharing between Ray workers
- --ipc=host
- --shm-size=16g
envs:
DATASET_SIZE: small # Options: small (1000 train, 200 test), medium (10k train, 2k test), full
TOTAL_EPOCHS: 1
TOTAL_STEPS: 10
TRAIN_BATCH_SIZE: 512 # Reduced from 512 for smaller steps
VAL_BATCH_SIZE: 256 # Reduced from 256 for smaller steps
SAVE_FREQ: 5 # Save checkpoints every 10 steps (reduced from 100)
TEST_FREQ: 5 # Test every 5 steps (reduced from 50)
MODEL_NAME: Qwen/Qwen2.5-3B-Instruct
WANDB_PROJECT_NAME: search_r1_like_async_rl
WANDB_EXPERIMENT_NAME: qwen2.5-3b-it_rm-searchR1-like-sgl-multiturn
CHECKPOINT_BUCKET_NAME: verl-search-interaction-checkpoints
file_mounts:
/checkpoints:
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
secrets:
WANDB_API_KEY: "" # Optional: Set to enable WandB logging. If not set, only console logging will be used.
setup: |
rm -f ~/.pip/pip.conf
rm -f ~/.config/pip/pip.conf
set -e
echo "=== VERL Search Tool Interaction Setup ==="
# System dependencies
echo "Installing system dependencies..."
sudo apt update && sudo apt install -y iproute2 npm
# Optional: Install AI CLI tools
npm i -g @anthropic-ai/claude-code -y
npm i -g @openai/codex -y
npm i -g @google/gemini-cli -y
# export IS_SANDBOX=1
# echo 'alias cx="codex --dangerously-bypass-approvals-and-sandbox --enable web_search_request"' >> ~/.bashrc
# echo 'alias ccd="claude --dangerously-skip-permissions"' >> ~/.bashrc
# echo 'alias cxh="codex -m gpt-5 -c model_reasoning_effort="high" --dangerously-bypass-approvals-and-sandbox --enable web_search_request"' >> ~/.bashrc
# echo 'alias gmi="gemini --telemetry false --yolo"' >> ~/.bashrc
# claude mcp add codex -s user -- codex -m gpt-5-codex -c model_reasoning_effort="high" --enable web_search_request mcp-server
# claude mcp add gpt -s user -- codex -m gpt-5 -c model_reasoning_effort="high" --enable web_search_request mcp-server
# claude mcp add gemini -- npx -y gemini-mcp-tool
# Python environment
echo "Setting up Python virtual environment..."
uv venv --python 3.10 --seed
source .venv/bin/activate
# Clone VERL repository
echo "Cloning VERL repository..."
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout v0.6.0
# Core dependencies
echo "Installing PyTorch and VERL..."
uv pip install "torch==2.8.*" torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
uv pip install -v -e .
uv pip install wheel
uv pip install packaging
uv pip install -r ./requirements_sglang.txt
# Search/retrieval specific dependencies
echo "Installing retrieval service dependencies..."
uv pip install faiss-gpu-cu12
# issue with uvloop version https://github.com/volcengine/verl/issues/3806
uv pip install uvloop==0.21.0
# Download Wikipedia corpus and FAISS index
echo "Downloading Wikipedia corpus and FAISS index..."
export save_path=~/dataset
mkdir -p $save_path
huggingface-cli download maknee/wiki-18-subsets wiki-18-100k.jsonl.gz --repo-type=dataset --local-dir $save_path
huggingface-cli download maknee/wiki-18-subsets e5_Flat-100k.index --repo-type=dataset --local-dir $save_path
# Move files to expected locations
mv $save_path/wiki-18-100k.jsonl.gz $save_path/wiki-18.jsonl.gz
mv $save_path/e5_Flat-100k.index $save_path/e5_Flat.index
# Decompress the JSONL file
gzip -d $save_path/wiki-18.jsonl.gz -f
# Data preparation
echo "Preparing search R1 dataset..."
python3 examples/data_preprocess/preprocess_search_r1_dataset.py
# sed -i 's/num_workers: *120/num_workers: 8/' examples/sglang_multiturn/config/tool_config/search_tool_config.yaml
# # Setup faiss
# # Activate conda (only in the current shell)
# eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
# # (Optional) Add conda to your default shell startup
# conda init
# # Reload shell config
# source ~/.bashrc
# # Create and activate the retriever environment with Python 3.10
# conda create -n retriever python=3.10 -y
# conda activate retriever
# # Install PyTorch (with GPU support) and related libraries
# conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# # Install other Python packages
# pip install transformers datasets pyserini huggingface_hub
# # Install the GPU version of faiss
# conda install faiss-gpu=1.9.0 -c pytorch -c nvidia -y
# # Install the API service framework
# pip install uvicorn fastapi hf_transfer
# echo "✓ Setup complete!"
# echo "Dataset location: ~/data/searchR1_processed_direct/"
# echo "VERL repository: $(pwd)"
git clone https://github.com/PeterGriffinJin/Search-R1/
run: |
set -e
echo "=== VERL Search Tool Interaction Training ==="
# Multi-node setup
HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
NUM_NODES=$SKYPILOT_NUM_NODES
NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
# Network configuration for distributed training
NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'dev \K\S+')
export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
export NCCL_SOCKET_IFNAME=$NETWORK_INTERFACE
# PyTorch multiprocessing configuration
export TORCH_MULTIPROCESSING_SHARING_STRATEGY=file_system
# Activate environment
source .venv/bin/activate
# Set up paths
cd verl
PROJECT_DIR="$(pwd)"
export PYTHONPATH="$PROJECT_DIR:$PYTHONPATH"
# Start retrieval service
echo "Starting retrieval server..."
# conda activate retriever
save_path=~/dataset
index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
retriever_name=e5
retriever_path=intfloat/e5-base-v2
python examples/sglang_multiturn/search_r1_like/local_dense_retriever/retrieval_server.py \
--index_path $index_file \
--corpus_path $corpus_file \
--topk 3 \
--retriever_name $retriever_name \
--retriever_model $retriever_path &
RETRIEVAL_PID=$!
sleep 10
conda deactivate
save_path=~/dataset
index_file=$save_path/e5_Flat.index
corpus_file=$save_path/wiki-18.jsonl
retriever_name=e5
retriever_path=intfloat/e5-base-v2
# WandB login (optional)
if [ -n "$WANDB_API_KEY" ]; then
echo "Logging into Weights & Biases..."
python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
fi
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
echo "Starting Ray head node on port 6379..."
ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265
# Wait for all nodes to connect
echo "Waiting for $NUM_NODES nodes to connect..."
retry_count=0
max_retries=30
while [ $retry_count -lt $max_retries ]; do
connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
echo "✓ All $NUM_NODES nodes connected"
break
fi
retry_count=$((retry_count+1))
sleep 10
done
# Display Ray cluster status
echo "Ray cluster status:"
ray status
echo "Starting search tool interaction training..."
cd $PROJECT_DIR
# Increase file descriptor limit
ulimit -n 65535
# Set up configuration paths
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
TRAIN_DATA="$HOME/data/searchR1_processed_direct/train.parquet"
VAL_DATA="$HOME/data/searchR1_processed_direct/test.parquet"
TOOL_CONFIG="$CONFIG_PATH/tool_config/search_tool_config.yaml"
# Configure logging based on WANDB_API_KEY availability
if [ -n "$WANDB_API_KEY" ]; then
LOGGER_CONFIG='["console","wandb"]'
WANDB_ARGS="trainer.project_name=$WANDB_PROJECT_NAME trainer.experiment_name=$WANDB_EXPERIMENT_NAME"
echo "✓ WandB logging enabled"
else
LOGGER_CONFIG='["console"]'
WANDB_ARGS=""
echo "ℹ WandB logging disabled (no API key provided)"
fi
# Training with search tool
python3 -m verl.trainer.main_ppo \
--config-path="$CONFIG_PATH" \
--config-name='search_multiturn_grpo' \
algorithm.adv_estimator=grpo \
data.train_batch_size=$TRAIN_BATCH_SIZE \
data.val_batch_size=$VAL_BATCH_SIZE \
data.max_prompt_length=4096 \
data.max_response_length=3000 \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.return_raw_chat=True \
actor_rollout_ref.model.path=$MODEL_NAME \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.285 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.rollout.max_model_len=15000 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=2 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.val_before_train=False \
trainer.logger="$LOGGER_CONFIG" \
$WANDB_ARGS \
trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
trainer.nnodes=$NUM_NODES \
trainer.save_freq=$SAVE_FREQ \
trainer.test_freq=$TEST_FREQ \
data.train_files="$TRAIN_DATA" \
data.val_files="$VAL_DATA" \
actor_rollout_ref.rollout.multi_turn.tool_config_path="$TOOL_CONFIG" \
trainer.total_epochs=$TOTAL_EPOCHS \
trainer.total_training_steps=$TOTAL_STEPS \
trainer.default_local_dir=/checkpoints
echo "✓ Training complete!"
# Model checkpoint merging
echo "Merging model checkpoints..."
LATEST_STEP=$(cat /checkpoints/latest_checkpointed_iteration.txt)
CHECKPOINT_DIR="/checkpoints/global_step_${LATEST_STEP}/actor"
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir /checkpoints/hf_model
echo "✓ Model saved to /checkpoints/hf_model"
echo "Training artifacts saved to cloud bucket: ${CHECKPOINT_BUCKET_NAME}"
# Cleanup retrieval service before starting vLLM
if [ -n "$RETRIEVAL_PID" ]; then
echo "Stopping retrieval service..."
kill $RETRIEVAL_PID 2>/dev/null || true
sleep 5
fi
else
# Worker node setup
echo "Worker node (rank $SKYPILOT_NODE_RANK) connecting to head at $HEAD_IP:6379..."
sleep 15
ps aux | grep ray | grep $HEAD_IP:6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
echo "✓ Worker node connected"
sleep infinity
fi
verl-grpo.yaml
# Usage:
# sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
#
# sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -y
resources:
accelerators: H100:1
memory: 128+
image_id: docker:verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
ports:
- 8265
- 9090
envs:
TOTAL_EPOCHS: 1
WANDB_PROJECT_NAME: skypilot-verl
WANDB_EXPERIMENT_NAME: grpo-code
CHECKPOINT_BUCKET_NAME: sky-verl-grpo-checkpoints
HF_UPLOAD_MODEL_NAME: "maknee/verl-grpo-code"
SAVE_FINAL_MODEL_HF_PATH: /checkpoints/hf_model
file_mounts:
/checkpoints:
store: nebius
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
/code:
name: code
source: llm/verl/code
mode: COPY
secrets:
HF_TOKEN: null
WANDB_API_KEY: null
setup: |
rm -f ~/.pip/pip.conf
rm -f ~/.config/pip/pip.conf
sudo apt install iproute2 -y
uv venv --python 3.10 --seed
source .venv/bin/activate
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout 83aebcc133663c12ac33ea3d5ba5c5c5b4687286
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
uv pip install -v -e .
uv pip install hf_transfer
uv pip install flashinfer-python
uv pip install "vllm==0.10.0" --torch-backend=auto
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
uv pip install datasets
uv pip install "ray[train]" "click<8.2.0"
uv pip install tqdm
# Pin uvloop to 0.21.0 to work around asyncio event loop bug
# See: https://github.com/volcengine/verl/issues/3806
uv pip install "uvloop==0.21.0"
echo "Downloading code dataset..."
mkdir -p ~/data/code
python3 /code/preprocess_rstar_coder.py --local_dir ~/data/code
echo "code dataset download completed"
run: |
HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
NUM_NODES=$SKYPILOT_NUM_NODES
NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
#NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'src \K\S+')
#export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'dev \K\S+')
export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
export NCCL_SOCKET_IFNAME=$NETWORK_INTERFACE
export VLLM_USE_V1=1
source .venv/bin/activate
python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
# This script is only available on skypilot-nightly>=1.0.0.dev20251114
# If you are using an older version, you can copy and paste the script from:
# https://github.com/skypilot-org/skypilot/blob/master/sky_templates/ray/start_cluster
export RAY_DASHBOARD_HOST=0.0.0.0
~/sky_templates/ray/start_cluster
# Head node: wait for workers and run training
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
# Wait for all worker nodes to join
retry_count=0
max_retries=30
while [ $retry_count -lt $max_retries ]; do
connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
echo "Connected nodes: $connected_nodes/$NUM_NODES (attempt $((retry_count+1))/$max_retries)"
if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
echo "All nodes connected to Ray cluster"
break
fi
retry_count=$((retry_count+1))
sleep 10
done
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/code/train.parquet \
data.val_files=$HOME/data/code/test.parquet \
data.train_batch_size=32 \
data.max_prompt_length=256 \
data.max_response_length=256 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.ppo_epochs=1 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=1 \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.max_num_batched_tokens=2048 \
actor_rollout_ref.rollout.trace.backend=weave \
actor_rollout_ref.rollout.trace.token2text=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=[console,wandb] \
trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
trainer.nnodes=$NUM_NODES \
trainer.save_freq=10 \
trainer.test_freq=1 \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.default_local_dir=/checkpoints \
trainer.project_name=$WANDB_PROJECT_NAME \
trainer.experiment_name=$WANDB_EXPERIMENT_NAME
LATEST_STEP=$(cat /checkpoints/latest_checkpointed_iteration.txt)
CHECKPOINT_DIR="/checkpoints/global_step_${LATEST_STEP}/actor"
if [ -z "$HF_TOKEN" ]; then
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir ${SAVE_FINAL_MODEL_HF_PATH} \
--hf_upload_path ${HF_UPLOAD_MODEL_NAME}
else
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir ${SAVE_FINAL_MODEL_HF_PATH}
fi
vllm serve /checkpoints/hf_model \
--host 0.0.0.0 \
--port 9090
fi
verl-ppo.yaml
# Usage:
# sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
#
# sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -y
resources:
infra: nebius
accelerators: H100:1
memory: 128+
image_id: docker:verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
ports:
- 8265
- 9090
num_nodes: 1
envs:
TOTAL_EPOCHS: 1
WANDB_PROJECT_NAME: skypilot-verl
WANDB_EXPERIMENT_NAME: ppo-math
CHECKPOINT_BUCKET_NAME: sky-verl-ppo-checkpoints
HF_UPLOAD_MODEL_NAME: "maknee/verl-ppo-math"
SAVE_FINAL_MODEL_HF_PATH: /checkpoints/hf_model
file_mounts:
/checkpoints:
store: nebius
name: ${CHECKPOINT_BUCKET_NAME}
mode: MOUNT
secrets:
HF_TOKEN: null
WANDB_API_KEY: null
setup: |
rm -f ~/.pip/pip.conf
rm -f ~/.config/pip/pip.conf
sudo apt install iproute2 -y
uv venv --python 3.10 --seed
source .venv/bin/activate
rm -rf verl
git clone https://github.com/volcengine/verl.git
cd verl
git checkout 83aebcc133663c12ac33ea3d5ba5c5c5b4687286
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
uv pip install -v -e .
uv pip install hf_transfer
uv pip install flashinfer-python
uv pip install "vllm==0.10.0" --torch-backend=auto
uv pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl"
uv pip install datasets
uv pip install "ray[train]" "click<8.2.0"
uv pip install tqdm
# Pin uvloop to 0.21.0 to work around asyncio event loop bug
# See: https://github.com/volcengine/verl/issues/3806
uv pip install "uvloop==0.21.0"
echo "Downloading Math dataset..."
mkdir -p ~/data/math
python3 "$(pwd)/examples/data_preprocess/math_dataset.py" --local_dir ~/data/math
echo "Math dataset download completed"
uv pip install zmq
run: |
HEAD_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
NUM_NODES=$SKYPILOT_NUM_NODES
NUM_GPUS_PER_NODE=$SKYPILOT_NUM_GPUS_PER_NODE
#NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'src \K\S+')
#export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
NETWORK_INTERFACE=$(ip route get 8.8.8.8 | grep -oP 'dev \K\S+')
export GLOO_SOCKET_IFNAME=$NETWORK_INTERFACE
export NCCL_SOCKET_IFNAME=$NETWORK_INTERFACE
export VLLM_USE_V1=1
source .venv/bin/activate
python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')"
# This script is only available on skypilot-nightly>=1.0.0.dev20251114
# If you are using an older version, you can copy and paste the script from:
# https://github.com/skypilot-org/skypilot/blob/master/sky_templates/ray/start_cluster
export RAY_DASHBOARD_HOST=0.0.0.0
~/sky_templates/ray/start_cluster
# Head node: wait for workers and run training
if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
# Wait for all worker nodes to join
retry_count=0
max_retries=30
while [ $retry_count -lt $max_retries ]; do
connected_nodes=$(ray status 2>/dev/null | grep -c "node_" || echo "0")
echo "Connected nodes: $connected_nodes/$NUM_NODES (attempt $((retry_count+1))/$max_retries)"
if [ "$connected_nodes" -ge "$NUM_NODES" ]; then
echo "All nodes connected to Ray cluster"
break
fi
retry_count=$((retry_count+1))
sleep 10
done
python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/math/train.parquet \
data.val_files=$HOME/data/math/test.parquet \
data.train_batch_size=256 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.trace.backend=weave \
actor_rollout_ref.rollout.trace.token2text=True \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
critic.optim.lr=1e-5 \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.model.fsdp_config.model_dtype=bfloat16 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=[console,wandb] \
trainer.val_before_train=False \
trainer.n_gpus_per_node=$NUM_GPUS_PER_NODE \
trainer.nnodes=$NUM_NODES \
trainer.save_freq=10 \
trainer.test_freq=1 \
trainer.default_local_dir=/checkpoints \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.project_name=$WANDB_PROJECT_NAME \
trainer.experiment_name=$WANDB_EXPERIMENT_NAME
LATEST_STEP=$(cat /checkpoints/latest_checkpointed_iteration.txt)
CHECKPOINT_DIR="/checkpoints/global_step_${LATEST_STEP}/actor"
if [ -n "$HF_TOKEN" ]; then
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir ${SAVE_FINAL_MODEL_HF_PATH} \
--hf_upload_path ${HF_UPLOAD_MODEL_NAME}
else
python -m verl.model_merger merge \
--backend fsdp \
--tie-word-embedding \
--local_dir ${CHECKPOINT_DIR} \
--target_dir ${SAVE_FINAL_MODEL_HF_PATH}
fi
vllm serve /checkpoints/hf_model \
--host 0.0.0.0 \
--port 9090
else
sleep 15
echo "Starting Ray worker node..."
ps aux | grep ray | grep $HEAD_IP:6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
sleep 10
fi
echo "Node setup and Ray start script finished for rank $SKYPILOT_NODE_RANK."