Vision Llama 3.2 (Meta)#

Llama 3.2 family was released by Meta on Sep 25, 2024. It not only includes the latest improved (and smaller) LLM models for chat, but also includes multimodal vision-language models. Let’s point and launch it with SkyPilot.

Why use SkyPilot?#

  • Point, launch, and serve: simply point to the cloud/Kubernetes cluster you have access to, and launch the model there with a single command.

  • No lock-in: run on any supported cloud — AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI

  • Everything stays in your cloud account (your VMs & buckets)

  • No one else sees your chat history

  • Pay absolute minimum — no managed solution markups

  • Freely choose your own model size, GPU type, number of GPUs, etc, based on scale and budget.

…and you get all of this with 1 click — let SkyPilot automate the infra.

Prerequisites#

SkyPilot YAML#

Click to see the full recipe YAML
envs:
  MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct
  # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 8+
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  # Install huggingface transformers for the support of Llama 3.2
  pip install git+https://github.com/huggingface/transformers.git@f0eabf6c7da2afbe8425546c092fa3722f9f219e
  pip install vllm==0.6.2

run: |
  echo 'Starting vllm api server...'

  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096 \
    2>&1

You can also get the full YAML file here.

Point and Launch Llama 3.2#

Launch a single spot instance to serve Llama 3.2 on your infra:

$ HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
...
------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------
 Kubernetes   4CPU--16GB--1L4                4       16        L4:1           kubernetes      0.00          ✔
 RunPod       1x_L4_SECURE                   4       24        L4:1           CA              0.44
 GCP          g2-standard-4                  4       16        L4:1           us-east4-a      0.70
 AWS          g6.xlarge                      4       16        L4:1           us-east-1       0.80
 AWS          g5.xlarge                      4       16        A10G:1         us-east-1       1.01
 RunPod       1x_L40_SECURE                  16      48        L40:1          CA              1.14
 Fluidstack   L40_48GB::1                    32      60        L40:1          CANADA          1.15
 AWS          g6e.xlarge                     4       32        L40S:1         us-east-1       1.86
 Cudo         sapphire-rapids-h100_1x4v8gb   4       8         H100:1         ca-montreal-3   2.86
 Fluidstack   H100_PCIE_80GB::1              28      180       H100:1         CANADA          2.89
 Azure        Standard_NV36ads_A10_v5        36      440       A10:1          eastus          3.20
 GCP          a2-highgpu-1g                  12      85        A100:1         us-central1-a   3.67
 RunPod       1x_H100_SECURE                 16      80        H100:1         CA              4.49
 Azure        Standard_NC40ads_H100_v5       40      320       H100:1         eastus          6.98
------------------------------------------------------------------------------------------------------------------

Wait until the model is ready (this can take 10+ minutes).

🎉 Congratulations! 🎉 You have now launched the Llama 3.2 Instruct LLM on your infra.

Chat with Llama 3.2 with OpenAI API#

To curl /v1/chat/completions:

ENDPOINT=$(sky status --endpoint 8081 llama3_2)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq .

Example outputs:

{
  "id": "chat-e7b6d2a2d2934bcab169f82812601baf",
  "object": "chat.completion",
  "created": 1727291780,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm an artificial intelligence model known as Llama. Llama stands for \"Large Language Model Meta AI.\"",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "total_tokens": 68,
    "completion_tokens": 23
  },
  "prompt_logprobs": null
}

To stop the instance:

sky stop llama3_2

To shut down all resources:

sky down llama3_2

Point and Launch Vision Llama 3.2#

Let’s launch a vision llama now! The multimodal capacity of Llama-3.2 could open up a lot of new use cases. We will go with the largest 11B model here.

$ HF_TOKEN=xxx sky launch llama3_2-vision-11b.yaml -c llama3_2-vision --env HF_TOKEN
------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--1H100               2       8         H100:1         kubernetes      0.00          ✔
 RunPod       1x_L40_SECURE                  16      48        L40:1          CA              1.14
 Fluidstack   L40_48GB::1                    32      60        L40:1          CANADA          1.15
 AWS          g6e.xlarge                     4       32        L40S:1         us-east-1       1.86
 RunPod       1x_A100-80GB_SECURE            8       80        A100-80GB:1    CA              1.99
 Cudo         sapphire-rapids-h100_1x2v4gb   2       4         H100:1         ca-montreal-3   2.83
 Fluidstack   H100_PCIE_80GB::1              28      180       H100:1         CANADA          2.89
 GCP          a2-highgpu-1g                  12      85        A100:1         us-central1-a   3.67
 Azure        Standard_NC24ads_A100_v4       24      220       A100-80GB:1    eastus          3.67
 RunPod       1x_H100_SECURE                 16      80        H100:1         CA              4.49
 GCP          a2-ultragpu-1g                 12      170       A100-80GB:1    us-central1-a   5.03
 Azure        Standard_NC40ads_H100_v5       40      320       H100:1         eastus          6.98
------------------------------------------------------------------------------------------------------------------

Chat with Vision Llama 3.2#

ENDPOINT=$(sky status --endpoint 8081 llama3_2-vision)

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Turn this logo into ASCII art."},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 1024
    }' | jq .

Example output (parsed):

  1. Output 1

-------------
-        -
-   -   -
-   -   -
-        -
-------------
  1. Output 2

        ^_________
       /          \\
      /            \\
     /______________\\
     |               |
     |               |
     |_______________|
       \\            /
        \\          /
         \\________/
Raw output
{
  "id": "chat-c341b8a0b40543918f3bb2fef68b0952",
  "object": "chat.completion",
  "created": 1727295337,
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure, here is the logo in ASCII art:\n\n------------- \n-        - \n-   -   - \n-   -   - \n-        - \n------------- \n\nNote that this is a very simple representation and does not capture all the details of the original logo.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 73,
    "completion_tokens": 55
  },
  "prompt_logprobs": null
}

Serving Llama-3: scaling up with SkyServe#

After playing with the model, you can deploy the model with autoscaling and load-balancing using SkyServe.

With no change to the YAML, launch a fully managed service on your infra:

HF_TOKEN=xxx sky serve up llama3_2-vision-11b.yaml -n llama3_2 --env HF_TOKEN

Wait until the service is ready:

watch -n10 sky serve status llama3_2
Example outputs:
Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
llama3_2  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                       STATUS  REGION
llama3_2          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'A100-80GB': 8})  READY   us-east4
llama3_2          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'A100-80GB': 8})  READY   us-east4

Get a single endpoint that load-balances across replicas:

ENDPOINT=$(sky serve status --endpoint llama3_2)

Tip: SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.

To curl the endpoint:

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Covert this logo to ASCII art"},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 2048
    }' | jq .

To shut down all resources:

sky serve down llama3

See more details in SkyServe docs.

Developing and Finetuning Llama 3 series#

SkyPilot also simplifies the development and finetuning of Llama 3 series. Check out the development and finetuning guides: Develop and Finetune.