Source: llm/kimi-k2-thinking
Kimi K2 Thinking#
Kimi K2 Thinking is an advanced large language model created by Moonshot AI.
This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:
Low Latency (TP8): Best for interactive applications requiring quick responses
High Throughput (TP8+DCP8): Best for batch processing and high-volume serving scenarios
Prerequisites#
Check that you have installed SkyPilot (docs).
Check that
sky checkshows clouds or Kubernetes is enabled.Note: This model requires 8x H200 or H20 GPUs.
Run Kimi K2 Thinking (Low Latency Mode)#
For low-latency scenarios, use tensor parallelism:
sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
kimi-k2-thinking.sky.yaml uses tensor parallelism across 8 GPUs for optimal low-latency performance.
🎉 Congratulations! 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.
Run Kimi K2 Thinking (High Throughput Mode)#
For high-throughput scenarios, use Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput:
sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
The kimi-k2-thinking-high-throughput.sky.yaml adds --decode-context-parallel-size 8 to enable DCP:
run: |
echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size 8 \
--decode-context-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code
DCP Performance Gains#
From vLLM’s benchmark:
Metric |
TP8 (Low Latency) |
TP8+DCP8 (High Throughput) |
Improvement |
|---|---|---|---|
Request Throughput (req/s) |
1.25 |
1.57 |
+25.6% |
Output Token Throughput (tok/s) |
485.78 |
695.13 |
+43.1% |
Mean TTFT (sec) |
271.2 |
227.8 |
+16.0% |
KV Cache Size (tokens) |
715,072 |
5,721,088 |
8x |
Chat with Kimi K2 Thinking with OpenAI API#
To curl /v1/chat/completions:
ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with deep reasoning capabilities."
},
{
"role": "user",
"content": "Explain how to solve the traveling salesman problem for 10 cities."
}
]
}' | jq .
The model will provide its reasoning process in the response, showing its chain-of-thought approach.
Clean up resources#
To shut down all resources:
sky down kimi-k2-thinking
Serving Kimi-K2-Thinking: scaling up with SkyServe#
With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:
sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
Wait until the service is ready:
watch -n10 sky serve status kimi-k2-thinking
Get a single endpoint that load-balances across replicas:
ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)
Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
To curl the endpoint:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with deep reasoning capabilities."
},
{
"role": "user",
"content": "Design a distributed system for real-time analytics."
}
]
}' | jq .
To shut down all resources:
sky serve down kimi-k2-thinking
See more details in SkyServe docs.
Included files#
kimi-k2-thinking-high-throughput.sky.yaml
# Serve Kimi-K2-Thinking with SkyPilot and vLLM (High Throughput Mode).
# Uses Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput.
#
# Usage:
# sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
# sky serve up kimi-k2-thinking-high-throughput.sky.yaml -n kimi-k2-thinking-ht
envs:
MODEL_NAME: moonshotai/Kimi-K2-Thinking
resources:
image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
accelerators: H200:8
cpus: 100+
memory: 1000+
ports: 8081
run: |
echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size 8 \
--decode-context-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code
service:
replicas: 1
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: What is 2+2?
max_tokens: 10
kimi-k2-thinking.sky.yaml
# Serve Kimi-K2-Thinking with SkyPilot and vLLM (Low Latency Mode).
# This model supports deep thinking & tool orchestration with reasoning capabilities.
#
# Usage:
# sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
# sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
envs:
MODEL_NAME: moonshotai/Kimi-K2-Thinking
resources:
image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
accelerators: H200:8
cpus: 100+
memory: 1000+
ports: 8081
run: |
echo 'Starting vLLM API server for Kimi-K2-Thinking (Low Latency Mode)...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code
service:
replicas: 1
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: What is 2+2?
max_tokens: 10