Source: llm/gemma
Gemma: Open-source Gemini#
Google released Gemma and has made a big wave in the AI community. It opens the opportunity for the open-source community to serve and finetune private Gemini.
Serve Gemma on any Cloud#
Serving Gemma on any cloud is easy with SkyPilot. With serve.yaml in this directory, you host the model on any cloud with a single command.
Prerequsites#
Apply for access to the Gemma model
Go to the application page and click Acknowledge license to apply for access to the model weights.
Get the access token from huggingface
Generate a read-only access token on huggingface here, and make sure your huggingface account can access the Gemma models here.
Install SkyPilot
pip install "skypilot-nightly[all]"
For detailed installation instructions, please refer to the installation guide.
Host on a Single Instance#
We can host the model with a single instance:
HF_TOKEN="xxx" sky launch -c gemma serve.yaml --env HF_TOKEN
After the cluster is launched, we can access the model with the following command:
IP=$(sky status --ip gemma)
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"prompt": "My favourite condiment is",
"max_tokens": 25
}' | jq .
Chat API is also supported:
IP=$(sky status --ip gemma)
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"messages": [
{
"role": "user",
"content": "Hello! What is your name?"
}
],
"max_tokens": 25
}'
Scale the Serving with SkyServe#
Using the same YAML, we can easily scale the model serving across multiple instances, regions and clouds with SkyServe:
HF_TOKEN="xxx" sky serve up -n gemma serve.yaml --env HF_TOKEN
Notice the only change is from
sky launch
tosky serve up
. The same YAML can be used without changes.
After the cluster is launched, we can access the model with the following command:
ENDPOINT=$(sky serve status --endpoint gemma)
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"prompt": "My favourite condiment is",
"max_tokens": 25
}' | jq .
Chat API is also supported:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"messages": [
{
"role": "user",
"content": "Hello! What is your name?"
}
],
"max_tokens": 25
}'
Included files#
serve.yaml
# A example yaml for serving Gemma model from Google with an OpenAI API.
# Usage:
# 1. Launch on a single instance: `sky launch -c gemma ./serve.yaml`
# 2. Scale up to multiple instances with a single endpoint:
# `sky serve up -n gemma ./serve.yaml`
service:
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
replicas: 2
envs:
MODEL_NAME: google/gemma-7b-it
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
ports: 8000
disk_tier: best
setup: |
conda activate gemma
if [ $? -ne 0 ]; then
conda create -n gemma -y python=3.10
conda activate gemma
fi
pip install vllm==0.3.2
pip install transformers==4.38.1
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate gemma
export PATH=$PATH:/sbin
# --max-model-len is set to 1024 to avoid taking too much GPU memory on L4 and
# A10g with small memory.
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log