Source: llm/gpt-oss-finetuning

Finetuning OpenAI gpt-oss Models with SkyPilot#

On August 5, 2025, OpenAI released gpt-oss, including two state-of-the-art open-weight language models: gpt-oss-120b and gpt-oss-20b. These models deliver strong real-world performance at low cost and are available under the flexible Apache 2.0 license.

The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while the gpt-oss-20b model delivers similar results to OpenAI o3-mini.

This guide walks through how to finetune both models with LoRA/full finetuning using 🤗 Accelerate.

If you’re looking to run inference on gpt-oss models, check out the inference example

Cloud Logos

Step 0: Setup infrastructure#

SkyPilot is a framework for running AI and batch workloads on any infrastructure, offering unified execution, high cost savings, and high GPU availability.

Install SkyPilot#

pip install 'skypilot[all]'

For more details on how to setup your cloud credentials see SkyPilot docs.

Choose your infrastructure#

sky check

Configure checkpoint storage (Optional)#

Checkpoint storage is optional and only needed if you want to resume training from interruptions. By default, checkpoints are saved locally on the cluster.

To enable checkpoint persistence across cluster restarts, uncomment and configure the S3 bucket in the YAML files:

file_mounts:
  /checkpoints:
    source: s3://my-skypilot-bucket  # change this to your bucket

Step 1: Run gpt-oss models#

Full finetuning#

For gpt-oss-20b (smaller model):

Requirements: 1 node, 8x H100 GPUs

sky launch -c gpt-oss-20b-sft gpt-oss-20b-sft.yaml

For gpt-oss-120b (larger model):

Requirements: 4 nodes, 8x H200 GPUs each

sky launch -c gpt-oss-120b-sft gpt-oss-120b-sft.yaml

# gpt-oss-120b-sft.yaml
resources:
  accelerators: H200:8
  network_tier: best

file_mounts:
  /sft: ./sft
  /checkpoints:
    source: s3://my-skypilot-bucket  # change this to your bucket

envs:
  WANDB_PROJECT: gpt-oss-120b-sft
  WANDB_RESUME: allow
  WANDB_API_KEY: ""  # optionally, enable WandB tracking by providing the API key

num_nodes: 4

setup: |
  conda install cuda -c nvidia
  uv venv ~/training --seed --python 3.10
  source ~/training/bin/activate
  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
  uv pip install deepspeed
  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
  uv pip install wandb

  uv pip install nvitop

run: |
  export WANDB_RUN_ID=$SKYPILOT_TASK_ID
  export WANDB_NAME=run-$SKYPILOT_TASK_ID
  source ~/training/bin/activate

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  accelerate launch \
    --config_file /sft/fsdp2_120b.yaml \
    --num_machines $SKYPILOT_NUM_NODES \
    --num_processes $NP \
    --machine_rank $SKYPILOT_NODE_RANK \
    --main_process_ip $MASTER_ADDR \
    --main_process_port 29500 \
    /sft/train.py --model_id openai/gpt-oss-120b --resume_from_checkpoint

LoRA finetuning#

For gpt-oss-20b with LoRA:

Requirements: 1 node, 2x H100 GPU

sky launch -c gpt-oss-20b-lora gpt-oss-20b-lora.yaml

For gpt-oss-120b with LoRA:

Requirements: 1 node, 8x H100 GPUs

sky launch -c gpt-oss-120b-lora gpt-oss-120b-lora.yaml

Step 2: Monitor and get results#

Once your finetuning job is running, you can monitor the progress and retrieve results:

# Check job status
sky status

# View logs
sky logs <cluster-name>

# Download results when complete
sky down <cluster-name>

Optional: WandB tracking#

To enable experiment tracking with Weights & Biases, set your API key in the YAML configuration:

envs:
  WANDB_API_KEY: "your-wandb-api-key"

Each training run will automatically use a unique run ID based on the SkyPilot task ID for easy tracking and resuming.

Example full finetuning progress#

Here’s what you can expect to see during training - the loss should decrease and token accuracy should improve over time:

gpt-oss-20b training progress#

Training Progress for gpt-oss-20b on Nebius:
  6%|▋         | 1/16 [01:18<19:31, 78.12s/it]
{'loss': 2.2344, 'grad_norm': 17.139, 'learning_rate': 0.0, 'num_tokens': 51486.0, 'mean_token_accuracy': 0.5436, 'epoch': 0.06}

 12%|█▎        | 2/16 [01:23<08:10, 35.06s/it]
{'loss': 2.1689, 'grad_norm': 16.724, 'learning_rate': 0.0002, 'num_tokens': 105023.0, 'mean_token_accuracy': 0.5596, 'epoch': 0.12}

 25%|██▌       | 4/16 [01:34<03:03, 15.26s/it]
{'loss': 2.1548, 'grad_norm': 3.983, 'learning_rate': 0.000192, 'num_tokens': 214557.0, 'mean_token_accuracy': 0.5182, 'epoch': 0.25}

 50%|█████     | 8/16 [01:56<00:59,  7.43s/it]
{'loss': 2.1323, 'grad_norm': 3.460, 'learning_rate': 0.000138, 'num_tokens': 428975.0, 'mean_token_accuracy': 0.5432, 'epoch': 0.5}

 75%|███████▌  | 12/16 [02:15<00:21,  5.50s/it]
{'loss': 1.4624, 'grad_norm': 0.888, 'learning_rate': 6.5e-05, 'num_tokens': 641021.0, 'mean_token_accuracy': 0.6522, 'epoch': 0.75}

100%|██████████| 16/16 [02:34<00:00,  4.88s/it]
{'loss': 1.1294, 'grad_norm': 0.713, 'learning_rate': 2.2e-05, 'num_tokens': 852192.0, 'mean_token_accuracy': 0.7088, 'epoch': 1.0}

Final Training Summary:
{'train_runtime': 298.36s, 'train_samples_per_second': 3.352, 'train_steps_per_second': 0.054, 'train_loss': 2.086, 'epoch': 1.0}
✓ Job finished (status: SUCCEEDED).

Memory and GPU utilization using nvitop

nvitop

gpt-oss-120b training progress#

Training Progress for gpt-oss-120b on 4 nodes:
  3%|▏         | 1/32 [03:45<116:23, 225.28s/it]
  6%|▋         | 2/32 [06:12<90:21, 181.05s/it]
  9%|▉         | 3/32 [08:45<71:22, 147.67s/it]
 12%|█▎        | 4/32 [11:18<59:44, 128.01s/it]
 25%|██▌       | 8/32 [22:36<67:48, 169.50s/it]
 44%|████▍     | 14/32 [29:03<43:37, 145.41s/it]

Memory and GPU utilization using nvitop

nvitop

Configuration files#

You can find the complete configurations in the following directory.

Included files#

gpt-oss-120b-lora.yaml

resources:
  accelerators: H100:8
  disk_size: 1024
  network_tier: best

file_mounts:
  /sft: ./sft
  # Uncomment to enable checkpoint persistence across cluster restarts by saving them to S3
  # /checkpoints:
  #   source: s3://my-skypilot-bucket # change this to your bucket

envs:
  WANDB_PROJECT: gpt-oss-120b-lora
  WANDB_RESUME: allow
  WANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API key

num_nodes: 1

setup: |
  conda install cuda -c nvidia
  uv venv ~/training --seed --python 3.10
  source ~/training/bin/activate
  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
  uv pip install deepspeed
  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
  uv pip install wandb

  uv pip install nvitop

run: |
  export WANDB_RUN_ID=$SKYPILOT_TASK_ID
  export WANDB_NAME=run-$SKYPILOT_TASK_ID
  source ~/training/bin/activate

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python /sft/train.py --model_id openai/gpt-oss-120b --enable_lora --resume_from_checkpoint

  

gpt-oss-120b-sft.yaml

resources:
  accelerators: H200:8
  disk_size: 1024
  network_tier: best

file_mounts:
  /sft: ./sft
  # Uncomment to enable checkpoint persistence across cluster restarts by saving them to S3
  # /checkpoints:
  #   source: s3://my-skypilot-bucket # change this to your bucket

envs:
  WANDB_PROJECT: gpt-oss-120b-sft
  WANDB_RESUME: allow
  WANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API key

num_nodes: 4

setup: |
  conda install cuda -c nvidia
  uv venv ~/training --seed --python 3.10
  source ~/training/bin/activate
  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
  uv pip install deepspeed
  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
  uv pip install wandb

  uv pip install nvitop

run: |
  export WANDB_RUN_ID=$SKYPILOT_TASK_ID
  export WANDB_NAME=run-$SKYPILOT_TASK_ID
  source ~/training/bin/activate

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  accelerate launch --config_file /sft/fsdp2_120b.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-120b --resume_from_checkpoint

gpt-oss-20b-lora.yaml

resources:
  accelerators: H100:2
  disk_size: 512
  network_tier: best

file_mounts:
  /sft: ./sft
  # Uncomment to enable checkpoint persistence across cluster restarts by saving them to S3
  # /checkpoints:
  #   source: s3://my-skypilot-bucket # change this to your bucket

envs:
  WANDB_PROJECT: gpt-oss-20b-lora
  WANDB_RESUME: allow
  WANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API key

num_nodes: 1

setup: |
  conda install cuda -c nvidia
  uv venv ~/training --seed --python 3.10
  source ~/training/bin/activate
  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
  uv pip install deepspeed
  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
  uv pip install wandb

  uv pip install nvitop

run: |
  export WANDB_RUN_ID=$SKYPILOT_TASK_ID
  export WANDB_NAME=run-$SKYPILOT_TASK_ID
  source ~/training/bin/activate

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python /sft/train.py --model_id openai/gpt-oss-20b --enable_lora --resume_from_checkpoint

  

gpt-oss-20b-sft.yaml

name: gpt-oss-20b-sft-finetuning

resources:
  accelerators: H100:8
  disk_size: 512
  network_tier: best

file_mounts:
  /sft: ./sft
  # Uncomment to enable checkpoint persistence across cluster restarts by saving them to S3
  # /checkpoints:
  #   source: s3://my-skypilot-bucket # change this to your bucket

envs:
  WANDB_PROJECT: gpt-oss-20b-sft
  WANDB_RESUME: allow
  WANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API key

num_nodes: 1

setup: |
  conda install cuda -c nvidia
  uv venv ~/training --seed --python 3.10
  source ~/training/bin/activate
  uv pip install torch --index-url https://download.pytorch.org/whl/cu128
  uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
  uv pip install deepspeed
  uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
  uv pip install wandb

  uv pip install nvitop

run: |
  export WANDB_RUN_ID=$SKYPILOT_TASK_ID
  export WANDB_NAME=run-$SKYPILOT_TASK_ID
  source ~/training/bin/activate

  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  accelerate launch --config_file /sft/fsdp2.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-20b --resume_from_checkpoint

sft/fsdp2.yaml

# Requires accelerate 1.7.0 or higher
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: false
  fsdp_use_orig_params: true
  # fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_forward_prefetch: true
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sft/fsdp2_120b.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_transformer_layer_cls_to_wrap: GptOssDecoderLayer
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_reshard_after_forward: true
  fsdp_use_orig_params: false
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_forward_prefetch: false
  fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sft/train.py

import argparse
import os

from accelerate import Accelerator
from accelerate import ProfileKwargs
from datasets import load_dataset
from peft import get_peft_model
from peft import LoraConfig
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import Mxfp4Config
from trl import SFTConfig
from trl import SFTTrainer


class ProfilingSFTTrainer(SFTTrainer):

    def __init__(self, *args, accelerator_profiler=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.accelerator_profiler = accelerator_profiler

    def training_step(self, *args, **kwargs):
        result = super().training_step(*args, **kwargs)
        if self.accelerator_profiler is not None:
            self.accelerator_profiler.step()
        return result

    def train(self, resume_from_checkpoint=None, *args, **kwargs):
        if resume_from_checkpoint or (self.args.resume_from_checkpoint and
                                      os.path.exists(self.args.output_dir)):
            checkpoint_path = resume_from_checkpoint
            if not checkpoint_path:
                # Find the latest checkpoint
                checkpoint_dirs = [
                    d for d in os.listdir(self.args.output_dir)
                    if d.startswith("checkpoint-") and
                    os.path.isdir(os.path.join(self.args.output_dir, d))
                ]
                if checkpoint_dirs:
                    checkpoint_dirs.sort(key=lambda x: int(x.split("-")[1]))
                    checkpoint_path = os.path.join(self.args.output_dir,
                                                   checkpoint_dirs[-1])

            if checkpoint_path:
                print(f"Resuming from checkpoint: {checkpoint_path}")

        return super().train(resume_from_checkpoint=resume_from_checkpoint,
                             *args,
                             **kwargs)


def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(
        description="Train a model using SFT on Codeforces dataset")
    parser.add_argument(
        "--model_id",
        type=str,
        default="openai/gpt-oss-120b",
        help="The model ID to use for training (default: openai/gpt-oss-120b)")
    parser.add_argument("--enable_lora",
                        action="store_true",
                        default=False,
                        help="Enable LoRA")
    parser.add_argument(
        "--enable_profiling",
        action="store_true",
        default=False,
        help="Enable accelerate profiling with chrome trace export")
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of gradient accumulation steps (default: 1)")
    parser.add_argument("--per_device_train_batch_size",
                        type=int,
                        default=1,
                        help="Training batch size per device (default: 1)")
    parser.add_argument(
        "--resume_from_checkpoint",
        action="store_true",
        default=False,
        help="Enable resuming from the latest checkpoint (default: False)")
    parser.add_argument(
        "--output_dir",
        type=str,
        default="/checkpoints",
        help="Directory to save checkpoints (default: /checkpoints)")
    args = parser.parse_args()

    # Setup profiling if enabled
    accelerator_kwargs = {}
    if args.enable_profiling:

        def trace_handler(p):
            p.export_chrome_trace(f"/tmp/trace_{p.step_num}.json")

        profile_kwargs = ProfileKwargs(activities=["cpu", "cuda"],
                                       schedule_option={
                                           "wait": 1,
                                           "warmup": 1,
                                           "active": 1,
                                           "repeat": 0,
                                           "skip_first": 1,
                                       },
                                       on_trace_ready=trace_handler)
        accelerator_kwargs['kwargs_handlers'] = [profile_kwargs]

    accelerator = Accelerator(**accelerator_kwargs)
    model_id = args.model_id

    # Load dataset
    num_proc = int(os.cpu_count() / 2)
    train_dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking",
                                 split="train",
                                 num_proc=num_proc)

    quantization_config = Mxfp4Config(dequantize=True)

    device_map_args = {}
    if args.enable_lora:
        device_map_args = {'device_map': 'auto'}

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        attn_implementation="eager",
        torch_dtype="auto",
        use_cache=False,
        quantization_config=quantization_config,
        **device_map_args,
    )

    print(f'Loaded model: {args.model_id}')

    if args.enable_lora:
        num_layers = 0
        target_parameters = []
        if args.model_id == 'openai/gpt-oss-120b':
            num_layers = 36
        elif args.model_id == 'openai/gpt-oss-20b':
            num_layers = 24

        for i in range(num_layers):
            target_parameters.append(f'{i}.mlp.experts.gate_up_proj')
            target_parameters.append(f'{i}.mlp.experts.down_proj')

        peft_config = LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules="all-linear",
            target_parameters=target_parameters,
        )
        model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()

    report_to = "wandb" if os.environ.get("WANDB_API_KEY") else "none"

    # Setup output directory for checkpoints
    output_dir = os.path.join(args.output_dir, model_id.replace('/', '-'))
    os.makedirs(output_dir, exist_ok=True)

    # Train model
    training_args = SFTConfig(
        output_dir=output_dir,
        learning_rate=2e-4,
        num_train_epochs=1,
        logging_steps=1,
        per_device_train_batch_size=args.per_device_train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        max_length=1024,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine_with_min_lr",
        lr_scheduler_kwargs={"min_lr_rate": 0.1},
        dataset_num_proc=num_proc,
        gradient_checkpointing=
        False,  # Disable gradient_checkpointing as we use FSDP activation_checkpointing
        report_to=report_to,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,
        resume_from_checkpoint=args.resume_from_checkpoint,
    )

    # Train model with optional profiling
    trainer_kwargs = {
        'args': training_args,
        'model': model,
        'train_dataset': train_dataset,
    }

    if args.enable_profiling:
        with accelerator.profile() as prof:
            trainer_kwargs['accelerator_profiler'] = prof
            trainer = ProfilingSFTTrainer(**trainer_kwargs)
            trainer.train()
    else:
        trainer = ProfilingSFTTrainer(**trainer_kwargs)
        trainer.train()


if __name__ == "__main__":
    main()