Source: llm/llama-2

Llama 2: Open LLM from Meta#

Llama-2 is the top open-source models on the Open LLM leaderboard today. It has been released with a license that authorizes commercial use. You can deploy a private Llama-2 chatbot with SkyPilot in your own cloud with just one simple command.

Why use SkyPilot to deploy over commercial hosted solutions?#

No lock-in: run on any supported cloud - AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI
Everything stays in your cloud account (your VMs & buckets)
No one else sees your chat history
Pay absolute minimum — no managed solution markups
Freely choose your own model size, GPU type, number of GPUs, etc, based on scale and budget.

…and you get all of this with 1 click — let SkyPilot automate the infra.

Pre-requisites#

Apply for the access to the Llama-2 model

Go to the application page and apply for the access to the model weights.

Get the access token from huggingface

Generate a read-only access token on huggingface here, and make sure your huggingface account can access the Llama-2 models here.

Fill the access token in the chatbot-hf.yaml and chatbot-meta.yaml file.

envs:
  MODEL_SIZE: 7
secrets:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI

Running your own Llama-2 chatbot with SkyPilot#

You can now host your own Llama-2 chatbot with SkyPilot using 1-click.

Start serving the LLaMA-7B-Chat 2 model on a single A100 GPU:

sky launch -c llama-serve -s chatbot-hf.yaml

Check the output of the command. There will be a sharable gradio link (like the last line of the following). Open it in your browser to chat with Llama-2.

(task, pid=20933) 2023-04-12 22:08:49 | INFO | gradio_web_server | Namespace(host='0.0.0.0', port=None, controller_url='http://localhost:21001', concurrency_count=10, model_list_mode='once', share=True, moderate=False)
(task, pid=20933) 2023-04-12 22:08:49 | INFO | stdout | Running on local URL:  http://0.0.0.0:7860
(task, pid=20933) 2023-04-12 22:08:51 | INFO | stdout | Running on public URL: https://<random-hash>.gradio.live

Llama-2 Demo

Optional: Try other GPUs:

sky launch -c llama-serve-l4 -s chatbot-hf.yaml --gpus L4

L4 is the latest generation GPU built for large inference AI workloads. Find more details here.

Optional: Serve the 13B model instead of the default 7B:

sky launch -c llama-serve -s chatbot-hf.yaml --env MODEL_SIZE=13

Optional: Serve the 70B Llama-2 model:

sky launch -c llama-serve-70b -s chatbot-hf.yaml --env MODEL_SIZE=70 --gpus A100-80GB:2

70B model

How to run Llama-2 chatbot with the FAIR model?#

You can also host the official FAIR model without using huggingface and gradio.

Launch the Llama-2 chatbot on the cloud:
```
sky launch -c llama chatbot-meta.yaml
```
Open another terminal and run:
```
ssh -L 7681:localhost:7681 llama
```
Open http://localhost:7681 in your browser and start chatting!

Included files#

chatbot-hf.yaml

resources:
  accelerators: A100:1
  disk_size: 1024
  disk_tier: best
  memory: 32+

envs:
  MODEL_SIZE: 7
secrets:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI

setup: |
  conda activate chatbot
  if [ $? -ne 0 ]; then
    conda create -n chatbot python=3.9 -y
    conda activate chatbot
  fi

  # Install dependencies
  pip install "fschat[model_worker,webui]"

  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
  conda activate chatbot
  
  echo 'Starting controller...'
  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
  sleep 10
  echo 'Starting model worker...'
  python -u -m fastchat.serve.model_worker \
            --model-path meta-llama/Llama-2-${MODEL_SIZE}b-chat-hf \
            --num-gpus $SKYPILOT_NUM_GPUS_PER_NODE 2>&1 \
            --host 127.0.0.1 \
            | tee model_worker.log &

  echo 'Waiting for model worker to start...'
  while ! `cat model_worker.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  python -u -m fastchat.serve.gradio_web_server --share | tee ~/gradio.log

chatbot-meta.yaml

resources:
  memory: 32+
  accelerators: A100:1
  disk_size: 1024
  disk_tier: best

envs:
  MODEL_SIZE: 7
secrets:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI

setup: |
  set -ex

  git clone https://github.com/facebookresearch/llama.git || true
  cd ./llama
  pip install -e .
  cd -

  git clone https://github.com/skypilot-org/sky-llama.git || true
  cd sky-llama
  pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  pip install -r requirements.txt
  pip install -e .
  cd -

  # Download the model weights from the huggingface hub, as the official
  # download script has some problem.
  git config --global credential.helper cache
  sudo apt -y install git-lfs
  pip install transformers
  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}', add_to_git_credential=True)"
  git clone https://huggingface.co/meta-llama/Llama-2-${MODEL_SIZE}b-chat

  wget https://github.com/tsl0922/ttyd/releases/download/1.7.2/ttyd.x86_64
  sudo mv ttyd.x86_64 /usr/local/bin/ttyd
  sudo chmod +x /usr/local/bin/ttyd

run: |
  cd sky-llama
  ttyd /bin/bash -c "torchrun --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE chat.py --ckpt_dir ~/sky_workdir/Llama-2-${MODEL_SIZE}b-chat --tokenizer_path ~/sky_workdir/Llama-2-${MODEL_SIZE}b-chat/tokenizer.model"