Source: llm/gemma

Gemma: Open-source Gemini#

Google released Gemma and has made a big wave in the AI community. It opens the opportunity for the open-source community to serve and finetune private Gemini.

Serve Gemma on any Cloud#

Serving Gemma on any cloud is easy with SkyPilot. With serve.yaml in this directory, you host the model on any cloud with a single command.

Prerequisites#

Apply for access to the Gemma model

Go to the application page and click Acknowledge license to apply for access to the model weights.

Get the access token from huggingface

Generate a read-only access token on huggingface here, and make sure your huggingface account can access the Gemma models here.

Install SkyPilot

pip install "skypilot-nightly[all]"

For detailed installation instructions, please refer to the installation guide.

Host on a Single Instance#

We can host the model with a single instance:

HF_TOKEN="xxx" sky launch -c gemma serve.yaml --env HF_TOKEN

After the cluster is launched, we can access the model with the following command:

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

Chat API is also supported:

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'

Scale the Serving with SkyServe#

Using the same YAML, we can easily scale the model serving across multiple instances, regions and clouds with SkyServe:

HF_TOKEN="xxx" sky serve up -n gemma serve.yaml --env HF_TOKEN

Notice the only change is from sky launch to sky serve up. The same YAML can be used without changes.

After the cluster is launched, we can access the model with the following command:

ENDPOINT=$(sky serve status --endpoint gemma)

curl http://$ENDPOINT/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

Chat API is also supported:

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'

Included files#

serve.yaml

# A example yaml for serving Gemma model from Google with an OpenAI API.
# Usage:
#  1. Launch on a single instance: `sky launch -c gemma ./serve.yaml`
#  2. Scale up to multiple instances with a single endpoint:
#     `sky serve up -n gemma ./serve.yaml`
service:
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  replicas: 2

envs:
  MODEL_NAME: google/gemma-7b-it
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

resources: 
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  ports: 8000
  disk_tier: best

setup: |
  conda activate gemma
  if [ $? -ne 0 ]; then
    conda create -n gemma -y python=3.10
    conda activate gemma
  fi
  pip install vllm==0.3.2
  pip install transformers==4.38.1
  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
  conda activate gemma
  export PATH=$PATH:/sbin
  # --max-model-len is set to 1024 to avoid taking too much GPU memory on L4 and
  # A10g with small memory.
  python -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model $MODEL_NAME \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log