Source: llm/yi
Running Yi with SkyPilot on Your Cloud#
🤖 The Yi series models are the next generation of open-source large language models trained from scratch by 01.AI.
Update (Sep 19, 2024) - SkyPilot now supports the Yi model(Yi-Coder Yi-1.5)!
Why use SkyPilot to deploy over commercial hosted solutions?#
Get the best GPU availability by utilizing multiple resources pools across Kubernetes clusters and multiple regions/clouds.
Pay absolute minimum — SkyPilot picks the cheapest resources across Kubernetes clusters and regions/clouds. No managed solution markups.
Scale up to multiple replicas across different locations and accelerators, all served with a single endpoint
Everything stays in your Kubernetes or cloud account (your VMs & buckets)
Completely private - no one else sees your chat history
Running Yi model with SkyPilot#
After installing SkyPilot, run your own Yi model on vLLM with SkyPilot in 1-click:
Start serving Yi-1.5 34B on a single instance with any available GPU in the list specified in yi15-34b.yaml with a vLLM powered OpenAI-compatible endpoint (You can also switch to yicoder-9b.yaml or other model for a smaller model):
sky launch -c yi yi15-34b.yaml
Send a request to the endpoint for completion:
ENDPOINT=$(sky status --endpoint 8000 yi)
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"prompt": "Who are you?",
"max_tokens": 512
}' | jq -r '.choices[0].text'
Send a request for chat completion:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "01-ai/Yi-1.5-34B-Chat",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
Included files#
yi15-34b.yaml
envs:
MODEL_NAME: 01-ai/Yi-1.5-34B-Chat
resources:
accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
yi15-6b.yaml
envs:
MODEL_NAME: 01-ai/Yi-1.5-6B-Chat
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
yi15-9b.yaml
envs:
MODEL_NAME: 01-ai/Yi-1.5-9B-Chat
resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
yicoder-1_5b.yaml
envs:
MODEL_NAME: 01-ai/Yi-Coder-1.5B-Chat
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
yicoder-9b.yaml
envs:
MODEL_NAME: 01-ai/Yi-Coder-9B-Chat
resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_tier: best
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log