Run and Serve OpenAI gpt-oss Models with SkyPilot and vLLM#
On August 5, 2025, OpenAI released gpt-oss, including two state-of-the-art open-weight language models: gpt-oss-120b
and gpt-oss-20b
. These models deliver strong real-world performance at low cost and are available under the flexible Apache 2.0 license.
The gpt-oss-120b
model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while the gpt-oss-20b
model delivers similar results to OpenAI o3-mini.
This guide walks through how to run and host gpt-oss models on any infrastructure using SkyPilot and vLLM, from local GPU workstations to Kubernetes clusters and public clouds (16+ clouds supported).
If you’re looking to finetune gpt-oss
models, check out the finetuning example
Step 0: Setup infrastructure#
SkyPilot is a framework for running AI and batch workloads on any infrastructure, offering unified execution, high cost savings, and high GPU availability.
Install SkyPilot#
pip install 'skypilot[all]'
For more details on how to setup your cloud credentials see SkyPilot docs.
Choose your infrastructure#
sky check
Step 1: Run gpt-oss models#
Basic deployment#
For gpt-oss-20b
(smaller model):
sky launch -c gpt-oss-20b gpt-oss-vllm.sky.yaml \
--env MODEL_NAME=openai/gpt-oss-20b
For gpt-oss-120b
(larger model):
sky launch -c gpt-oss-120b gpt-oss-vllm.sky.yaml \
--env MODEL_NAME=openai/gpt-oss-120b
Step 2: Get results#
Query your deployment#
Get the endpoint:
ENDPOINT=$(sky status --endpoint 8000 gpt-oss-20b)
# or for 120b model:
# ENDPOINT=$(sky status --endpoint 8000 gpt-oss-120b)
Test with cURL#
Basic completion:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-20b",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain what MXFP4 quantization is."
}
]
}' | jq .
Use with OpenAI SDK#
import os
from openai import OpenAI
ENDPOINT = os.getenv('ENDPOINT')
client = OpenAI(
base_url=f"http://{ENDPOINT}/v1",
api_key="EMPTY"
)
result = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(result.choices[0].message.content)
Step 3: Scale with SkyServe#
For production workloads, use SkyServe for automatic scaling and load balancing:
sky serve up -n gpt-oss-service gpt-oss-vllm.sky.yaml \
--env MODEL_NAME=openai/gpt-oss-20b -y
Check service status:
sky serve status
Get service endpoint:
ENDPOINT=$(sky serve status --endpoint gpt-oss-service)
Custom configuration#
The YAML configuration supports various customizations:
Reasoning effort: Models support low, medium, and high reasoning efforts. The reasoning level can be set in the system prompts, e.g., “Reasoning: high”.
Context length: Up to 128k tokens natively supported
Memory optimization: MXFP4 quantization reduces memory requirements
Tool use: Built-in support for function calling and web browsing
Integration with other tools#
The deployed endpoint is OpenAI-compatible, so it works with:
LangChain: For building complex AI applications
OpenAI Agents SDK: For agentic workflows
llm CLI tool: For command-line interactions
Any OpenAI-compatible client: Drop-in replacement
Configuration file#
You can find the complete configuration in gpt-oss-vllm.sky.yaml
.
Cleanup#
To shut down your deployment:
# For basic deployments
sky down gpt-oss-20b
sky down gpt-oss-120b
# For SkyServe deployments
sky serve down gpt-oss-service