Source: llm/kimi-k2

Kimi-K2#

Kimi K2 is a large language model developed by Moonshot AI. It features 32 billion activated parameters and 1 trillion total parameters, making it a powerful model that requires multi-node serving due to its substantial size.

Prerequisites#

Check that you have installed SkyPilot (docs).
Check that sky check shows clouds or Kubernetes are enabled.
Note: This model requires at least 16 H100s due to its large size.

Run Kimi K2#

HF_TOKEN=xxx sky launch kimi-k2.sky.yaml -c kimi-k2 --secret HF_TOKEN

The kimi-k2.sky.yaml file is as follows:

envs:
  MODEL_NAME: moonshotai/Kimi-K2-Instruct

secrets:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI

resources:
  image_id: docker:vllm/vllm-openai:v0.10.0
  network_tier: best
  accelerators: H100:8
  cpus: 100+
  memory: 1000+
  ports: 8081

# Uses multi-node for serving on H100s - need at least 16 H100s
num_nodes: 2

setup: |
  pip install blobfile

run: |
  echo "Starting Ray..."
  sudo chmod 777 -R /var/tmp
  HEAD_IP=`echo "$SKYPILOT_NODE_IPS" | head -n1`
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats --port 6379
    sleep 5
  else
    sleep 5
    ps aux | grep ray | grep 6379 &> /dev/null || ray start --address $HEAD_IP:6379 --disable-usage-stats
    # Add sleep to after `ray start` to give ray enough time to daemonize
    sleep 5
  fi

  sleep 10
  echo "Ray cluster started"
  ray status

  echo 'Starting vllm api server...'

  # Set VLLM_HOST_IP to the IP of the current node based on rank
  VLLM_HOST_IP=`echo "$SKYPILOT_NODE_IPS" | sed -n "$((SKYPILOT_NODE_RANK + 1))p"`
  export VLLM_HOST_IP

  # Only head node needs to start the vllm api server
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    vllm serve $MODEL_NAME \
      --port 8081 \
      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
      --pipeline-parallel-size $SKYPILOT_NUM_NODES \
      --max-model-len 32768 \
      --trust-remote-code
  else
    sleep infinity
  fi

service:
  replicas: 1
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

Due to Kimi K2’s large size (1 trillion total parameters with 32 billion activated parameters), this configuration uses multi-node serving with vLLM:

Pipeline Parallelism: The model is distributed across 2 nodes using pipeline-parallel-size
Tensor Parallelism: Each node uses 8 H100 GPUs with tensor-parallel-size
Ray Cluster: Coordinates the multi-node setup for distributed serving

🎉 Congratulations! 🎉 You have now launched the Kimi K2 LLM on your infra with multi-node serving.

Chat with Kimi K2 with OpenAI API#

To curl /v1/chat/completions:

ENDPOINT=$(sky status --endpoint 8081 kimi-k2)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq .

To stop the instance:

sky stop kimi-k2

To shut down all resources:

sky down kimi-k2

Serving Kimi-K2: scaling up with SkyServe#

With no change to the YAML, launch a fully managed service on your infra:

HF_TOKEN=xxx sky serve up kimi-k2.sky.yaml -n kimi-k2 --secret HF_TOKEN

Wait until the service is ready:

watch -n10 sky serve status kimi-k2

Get a single endpoint that load-balances across replicas:

ENDPOINT=$(sky serve status --endpoint kimi-k2)

Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq .

To shut down all resources:

sky serve down kimi-k2

See more details in SkyServe docs.

Included files#

kimi-k2.sky.yaml

# Serve Kimi-K2 with SkyPilot and vLLM.
# Uses multi-node serving with tensor parallelism and pipeline parallelism.
#
# Usage:
#   HF_TOKEN=xxx sky launch kimi-k2.yaml -c kimi-k2 --secret HF_TOKEN
#   HF_TOKEN=xxx sky serve up kimi-k2.yaml -n kimi-k2 --secret HF_TOKEN
envs:
  MODEL_NAME: moonshotai/Kimi-K2-Instruct

secrets:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI

resources:
  image_id: docker:vllm/vllm-openai:v0.10.0
  network_tier: best
  accelerators: H100:8
  cpus: 100+
  memory: 1000+
  ports: 8081

# Uses multi-node for serving on H100s - need at least 16 H100s
num_nodes: 2

setup: |
  pip install blobfile

run: |
  echo "Starting Ray..."
  sudo chmod 777 -R /var/tmp

  # This script is only available on skypilot-nightly>=1.0.0.dev20251114
  # If you are using an older version, you can copy and paste the script from:
  # https://github.com/skypilot-org/skypilot/blob/master/sky_templates/ray/start_cluster
  ~/sky_templates/ray/start_cluster

  echo 'Starting vllm api server...'

  # Set VLLM_HOST_IP to the IP of the current node based on rank
  VLLM_HOST_IP=`echo "$SKYPILOT_NODE_IPS" | sed -n "$((SKYPILOT_NODE_RANK + 1))p"`
  export VLLM_HOST_IP

  # Only head node needs to start the vllm api server
  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    vllm serve $MODEL_NAME \
      --port 8081 \
      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
      --pipeline-parallel-size $SKYPILOT_NUM_NODES \
      --max-model-len 32768 \
      --trust-remote-code
  else
    sleep infinity
  fi

service:
  replicas: 1
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1