Kimi K2 Thinking#

Kimi K2 Thinking is an advanced large language model created by Moonshot AI.

This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:

Low Latency (TP8): Best for interactive applications requiring quick responses
High Throughput (TP8+DCP8): Best for batch processing and high-volume serving scenarios

Prerequisites#

Check that you have installed SkyPilot (docs).
Check that sky check shows clouds or Kubernetes is enabled.
Note: This model requires 8x H200 or H20 GPUs.

Run Kimi K2 Thinking (Low Latency Mode)#

For low-latency scenarios, use tensor parallelism:

sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking

kimi-k2-thinking.sky.yaml uses tensor parallelism across 8 GPUs for optimal low-latency performance.

🎉 Congratulations! 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.

Run Kimi K2 Thinking (High Throughput Mode)#

For high-throughput scenarios, use Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput:

sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht

The kimi-k2-thinking-high-throughput.sky.yaml adds --decode-context-parallel-size 8 to enable DCP:

run: |
  echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
  
  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size 8 \
    --decode-context-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

DCP Performance Gains#

From vLLM’s benchmark:

Metric	TP8 (Low Latency)	TP8+DCP8 (High Throughput)	Improvement
Request Throughput (req/s)	1.25	1.57	+25.6%
Output Token Throughput (tok/s)	485.78	695.13	+43.1%
Mean TTFT (sec)	271.2	227.8	+16.0%
KV Cache Size (tokens)	715,072	5,721,088	8x

Chat with Kimi K2 Thinking with OpenAI API#

To curl /v1/chat/completions:

ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant with deep reasoning capabilities."
      },
      {
        "role": "user",
        "content": "Explain how to solve the traveling salesman problem for 10 cities."
      }
    ]
  }' | jq .

The model will provide its reasoning process in the response, showing its chain-of-thought approach.

Clean up resources#

To shut down all resources:

sky down kimi-k2-thinking

Serving Kimi-K2-Thinking: scaling up with SkyServe#

With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:

sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking

Wait until the service is ready:

watch -n10 sky serve status kimi-k2-thinking

Get a single endpoint that load-balances across replicas:

ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)

Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant with deep reasoning capabilities."
      },
      {
        "role": "user",
        "content": "Design a distributed system for real-time analytics."
      }
    ]
  }' | jq .

To shut down all resources:

sky serve down kimi-k2-thinking

See more details in SkyServe docs.

Included files#

kimi-k2-thinking-high-throughput.sky.yaml

# Serve Kimi-K2-Thinking with SkyPilot and vLLM (High Throughput Mode).
# Uses Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput.
#
# Usage:
#   sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
#   sky serve up kimi-k2-thinking-high-throughput.sky.yaml -n kimi-k2-thinking-ht
envs:
  MODEL_NAME: moonshotai/Kimi-K2-Thinking


resources:
  image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
  accelerators: H200:8
  cpus: 100+
  memory: 1000+
  ports: 8081

run: |
  echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
  
  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size 8 \
    --decode-context-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

service:
  replicas: 1
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: What is 2+2?
      max_tokens: 10

kimi-k2-thinking.sky.yaml

# Serve Kimi-K2-Thinking with SkyPilot and vLLM (Low Latency Mode).
# This model supports deep thinking & tool orchestration with reasoning capabilities.
#
# Usage:
#   sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
#   sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
envs:
  MODEL_NAME: moonshotai/Kimi-K2-Thinking

resources:
  image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
  accelerators: H200:8
  cpus: 100+
  memory: 1000+
  ports: 8081

run: |
  echo 'Starting vLLM API server for Kimi-K2-Thinking (Low Latency Mode)...'
  
  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --trust-remote-code

service:
  replicas: 1
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: What is 2+2?
      max_tokens: 10