Source: llm/qwen

Serving Qwen3/Qwen2 on Your Own Kubernetes or Cloud#

Qwen2 is one of the top open LLMs. As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the LMSYS Chatbot Arena Leaderboard.

Update (Apr 28, 2025) - SkyPilot now supports the Qwen3 model!

📰 Update (Sep 18, 2024) - SkyPilot now supports the Qwen2.5 model!

📰 Update (Jun 6, 2024) - SkyPilot now also supports the Qwen2 model! It further improves the competitive model, Qwen1.5.

📰 Update (April 26, 2024) - SkyPilot now also supports the Qwen1.5-110B model! It performs competitively with Llama-3-70B across a series of evaluations. Use qwen15-110b.yaml to serve the 110B model.

One command to start a Qwen3#

sky launch -c qwen qwen3-235b.yaml

qwen

References#

Qwen docs

Why use SkyPilot to deploy over commercial hosted solutions?#

Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds.
Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups.
Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
Everything stays in your Kubernetes or cloud account (your VMs & buckets)
Completely private - no one else sees your chat history

Running your own Qwen with SkyPilot#

After installing SkyPilot, run your own Qwen model on vLLM with SkyPilot in 1-click:

Start serving Qwen 110B on a single instance with any available GPU in the list specified in qwen15-110b.yaml with a vLLM powered OpenAI-compatible endpoint (You can also switch to qwen25-72b.yaml or qwen25-7b.yaml for a smaller model):

sky launch -c qwen qwen3-235b.yaml

Send a request to the endpoint for completion:

ENDPOINT=$(sky status --endpoint 8000 qwen)

curl http://$ENDPOINT/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen3-235B-A22B-FP8",
      "prompt": "My favorite food is",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

Send a request for chat completion:

curl http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen3-235B-A22B-FP8",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful and honest chat expert."
        },
        {
          "role": "user",
          "content": "What is the best food?"
        }
      ],
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

Qwen3 output

The concept of "the best food" is highly subjective and depends on personal preferences, cultural background, dietary needs, and even mood! For example:

- **Some crave comfort foods** like macaroni and cheese, ramen, or dumplings.  
- **Others prioritize health** and might highlight dishes like quinoa bowls, grilled salmon, or fresh salads.  
- **Global favorites** often include pizza, sushi, tacos, or curry.  
- **Unique or adventurous eaters** might argue for dishes like insects, fermented foods, or molecular gastronomy creations.  

Could you clarify what you mean by "best"? For instance:  
- Are you asking about taste, health benefits, cultural significance, or something else?  
- Are you looking for a specific dish, ingredient, or cuisine?  

This helps me tailor a more meaningful answer! 😊

Running Multimodal Qwen2-VL#

Start serving Qwen2-VL:

sky launch -c qwen2-vl qwen2-vl-7b.yaml

Send a multimodalrequest to the endpoint for completion:

ENDPOINT=$(sky status --endpoint 8000 qwen2-vl)

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "Qwen/Qwen2-VL-7B-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Covert this logo to ASCII art"},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 1024
    }' | jq .

Scale up the service with SkyServe#

With SkyPilot Serving, a serving library built on top of SkyPilot, scaling up the Qwen service is as simple as running:

sky serve up -n qwen ./qwen25-72b.yaml

This will start the service with multiple replicas on the cheapest available locations and accelerators. SkyServe will automatically manage the replicas, monitor their health, autoscale based on load, and restart them when needed.

A single endpoint will be returned and any request sent to the endpoint will be routed to the ready replicas.

To check the status of the service, run:

sky serve status qwen

After a while, you will see the following output:

Services
NAME  VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
Qwen  1        -       READY         2/2       3.85.107.228:30002  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED    RESOURCES                  STATUS REGION  
Qwen          1   1        -         2 mins ago  1x Azure({'A100-80GB': 8}) READY  eastus  
Qwen          2   1        -         2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 

As shown, the service is now backed by 2 replicas, one on Azure and one on GCP, and the accelerator type is chosen to be the cheapest available one on the clouds. That said, it maximizes the availability of the service while minimizing the cost.

To access the model, we use a curl command to send the request to the endpoint:

ENDPOINT=$(sky serve status --endpoint qwen)

curl http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen2.5-72B-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful and honest code assistant expert in Python."
        },
        {
          "role": "user",
          "content": "Show me the python code for quick sorting a list of integers."
        }
      ],
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

Optional: Accessing Qwen with Chat GUI#

It is also possible to access the Qwen service with a GUI using vLLM.

Start the chat web UI (change the --env flag to the model you are running):

sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2.5-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen)

Then, we can access the GUI at the returned gradio link:

| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

Included files#

gui.yaml

# Starts a GUI server that connects to the Qwen OpenAI API server.
#
# Refer to llm/qwen/README.md for more details.
#
# Usage:
#
#  1. If you have a endpoint started on a cluster (sky launch):
#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000`
#  2. If you have a SkyPilot Service started (sky serve up) called qwen:
#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)`
#
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.

envs:
  ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
  MODEL_NAME: Qwen/Qwen1.5-72B-Chat

resources:
  cpus: 2

setup: |
  conda activate qwen
  if [ $? -ne 0 ]; then
    conda create -n qwen python=3.10 -y
    conda activate qwen
  fi

  # Install Gradio for web UI.
  pip install gradio openai

run: |
  conda activate qwen
  export PATH=$PATH:/sbin
  WORKER_IP=$(hostname -I | cut -d' ' -f1)
  CONTROLLER_PORT=21001
  WORKER_PORT=21002

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 | tee ~/gradio.log

qwen15-110b.yaml

envs:
  MODEL_NAME: Qwen/Qwen1.5-110B-Chat

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen2-vl-7b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  disk_tier: best
  ports: 8000

setup: |
  # Install later transformers version for the support of
  # qwen2_vl support
  pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 2048 | tee ~/openai_api_server.log

qwen25-72b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2.5-72B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen25-7b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2.5-7B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  disk_tier: best
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen3-235b.yaml

envs:
  MODEL_NAME: Qwen/Qwen3-235B-A22B-FP8

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8, H100:8, H200:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  uv pip install "sglang>=0.4.6"

run: |
  export PATH=$PATH:/sbin
  export SGL_ENABLE_JIT_DEEPGEMM=1
  # --tp 4 is required even with 8 GPUs, as the output size
  # of qwen3 is not divisible by quantization block_n=128
  python3 -m sglang.launch_server --model $MODEL_NAME \
    --tp 4 --reasoning-parser qwen3 --port 8000 --host 0.0.0.0