Source: llm/gpt-2

Run GPT-2 in llm.c on any cloud with SkyPilot#

This is a reproducible package of llm.c’s GPT-2 (124M) training by @karpathy (karpathy/llm.c#481). With SkyPilot, you can run GPT-2 (124M) training on any cloud. SkyPilot looks for the cheapest resources available on the clouds enabled for a user, launches and manages the whole data processing and training pipeline, leading to a close to ~$20 target cost as @karpathy mentioned in the discussion.

Prerequisites#

Install SkyPilot:

pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable

Enable clouds for SkyPilot:

sky check

Please check the instructions for enabling clouds at SkyPilot doc.

Download the YAML for starting the training:

wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml

Run GPT-2 training#

Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 GPUs (replace your-bucket-name with your bucket name):

sky launch -c gpt2 gpt2.yaml

GPT-2 training with 8 A100 GPUs

Or, you can train the model with a single A100, by adding --gpus A100:

sky launch -c gpt2 gpt2.yaml --gpus A100

GPT-2 training with a single A100

It is also possible to speed up the training of the model on 8 H100 (2.3x more tok/s than 8x A100s):

sky launch -c gpt2 gpt2.yaml --gpus H100:8

GPT-2 training with 8 H100

Download logs and visualizations#

After the training is finished, you can download the logs and visualizations with the following command:

scp -r gpt2:~/llm.c/log124M .

We can visualize the training progress with the notebook provided in llm.c. (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.)

Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot.

Advanced: Run GPT-2 training in two stages#

The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily separate the data processing and training into two stages and execute them sequantially manually, or let SkyPilot manage the dependencies between the two stages.

With this data processing can be run on cheaper CPU VMs (e.g., ~$0.4/hour), and run the training on more expensive GPU VMs (e.g., ~$1.3-$3.6/hour for a single A100 GPU, or $10.3-$32.8/hour for 8 A100 GPUs).

We can run the data processing on a CPU VM and store the processed data in a cloud bucket. Then, we can run the training on a GPU VM with the processed data.

wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml

Run two stages manually#

Data processing#

Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace your-bucket-name with your bucket name):

sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name

Training#

After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace your-bucket-name with your bucket name):

sky launch -c gpt2-train gpt2-train.yaml --env BUCKET_NAME=your-bucket-name

Or, you can train the model with a single A100, by adding --gpus A100:

sky launch -c gpt2-train gpt2-train.yaml --gpus A100 --env BUCKET_NAME=your-bucket-name

Run in a Pipeline#

We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace your-bucket-name with your bucket name):

sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name

Note: the pipeline yaml can be retrieved with the following command:

cat gpt2-data.yaml > gpt2-pipeline.yaml; echo "---" >> gpt2-pipeline.yaml; cat gpt2-train.yaml >> gpt2-pipeline.yaml

SkyPilot will first download and process the dataset on a CPU VM and store the processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a GPU VM. The training job will train GPT-2 (124M) on the processed data.

Included files#

gpt2-data.yaml

name: gpt2-data

envs:
  BUCKET_NAME: # TODO: Fill in your bucket name
  BUCKET_STORE: s3 # Can be s3, gcs, or r2.

resources:
  cpus: 8+

file_mounts:
  /cache:
    name: $BUCKET_NAME
    store: $BUCKET_STORE
    mode: MOUNT

setup: |
  pip install tqdm tiktoken requests datasets
  git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true

  # Adding revision to fix the dataset version, as the latest fineweb
  # dataset removed the samples, causing error:
  #   Please pass `features` or at least one example when writing data
  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py


run: |
  cd llm.c
  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
  python dev/data/fineweb.py --version 10B

  rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
  rsync -Pavz dev/data/fineweb10B /cache/

gpt2-pipeline.yaml

name: gpt2-data

envs:
  BUCKET_NAME: # TODO: Fill in your bucket name
  BUCKET_STORE: s3 # Can be s3, gcs, or r2.

resources:
  cpus: 8+

file_mounts:
  /cache:
    name: $BUCKET_NAME
    store: $BUCKET_STORE
    mode: MOUNT

setup: |
  pip install tqdm tiktoken requests datasets
  git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true

  # Adding revision to fix the dataset version, as the latest fineweb
  # dataset removed the samples, causing error:
  #   Please pass `features` or at least one example when writing data
  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py


run: |
  cd llm.c
  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
  python dev/data/fineweb.py --version 10B

  rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
  rsync -Pavz dev/data/fineweb10B /cache/
---
name: gpt2-train

envs:
  BUCKET_NAME: # TODO: Fill in your bucket name
  BUCKET_STORE: s3 # Can be s3, gcs, or r2.

resources:
  accelerators: A100:8
  # Use docker image for latest version g++ to enable the compilation of llm.c.
  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  any_of:
    # Avoid using docker image for lambda due to the docker is not supported on
    # Lambda yet, but the base image works.
    - infra: lambda
      image_id: null
    - infra: aws
    - infra: gcp
    - infra: azure
    - infra: fluidstack
    - infra: kubernetes
  
file_mounts:
  ~/.cache/huggingface:
    name: $BUCKET_NAME
    store: $BUCKET_STORE
    mode: COPY

setup: |
  cd ~

  # install cudnn so we can use FlashAttention and run fast (optional)
  # https://developer.nvidia.com/cudnn-downloads
  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
  if [ -f ./CUDNN_INSTALLED ]; then
    echo "cudnn already installed"
  else
    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
    # Get version and remove the dot
    version=$(lsb_release -sr | tr -d .)
    export system_version="${system}${version}"
    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
    sudo dpkg -i cudnn-installer.deb
    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
    # Remove problematic kubernetes.list source
    sudo apt-get update --allow-releaseinfo-change || true

    sudo apt-get -y install cudnn-cuda-12

    touch ./CUDNN_INSTALLED
  fi

  # "install" cudnn-frontend to ~/
  sudo apt -y install git
  git clone https://github.com/NVIDIA/cudnn-frontend.git || true

  # install MPI (optional, if you intend to use multiple GPUs)
  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
  # installed.
  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
  # install nccl
  pip install nvidia-nccl-cu12
  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include

  git clone https://github.com/karpathy/llm.c.git || true
  cd llm.c
  ln -s ~/.cache/huggingface/fineweb10B dev/data/
  # compile llm.c (mixed precision, with cuDNN flash-attention)
  # first compilation is ~1 minute, mostly due to cuDNN
  make train_gpt2cu USE_CUDNN=1


run: |
  cd ~/llm.c
  # train on multiple GPUs
  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
      -o log124M \
      -e "d12" \
      -b 64 -t 1024 \
      -d 524288 \
      -r 1 \
      -z 1 \
      -c 0.1 \
      -l 0.0006 \
      -q 0.0 \
      -u 700 \
      -n 5000 \
      -v 250 -s 20000 \
      -h 1

  # Upload the log and model to the bucket
  rsync -Pavz log124M ~/.cache/huggingface

gpt2-train.yaml

name: gpt2-train

envs:
  BUCKET_NAME: # TODO: Fill in your bucket name
  BUCKET_STORE: s3 # Can be s3, gcs, or r2.

resources:
  accelerators: A100:8
  # Use docker image for latest version g++ to enable the compilation of llm.c.
  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  any_of:
    # Avoid using docker image for lambda due to the docker is not supported on
    # Lambda yet, but the base image works.
    - infra: lambda
      image_id: null
    - infra: aws
    - infra: gcp
    - infra: azure
    - infra: fluidstack
    - infra: kubernetes
  
file_mounts:
  ~/.cache/huggingface:
    name: $BUCKET_NAME
    store: $BUCKET_STORE
    mode: COPY

setup: |
  cd ~

  # install cudnn so we can use FlashAttention and run fast (optional)
  # https://developer.nvidia.com/cudnn-downloads
  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
  if [ -f ./CUDNN_INSTALLED ]; then
    echo "cudnn already installed"
  else
    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
    # Get version and remove the dot
    version=$(lsb_release -sr | tr -d .)
    export system_version="${system}${version}"
    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
    sudo dpkg -i cudnn-installer.deb
    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
    # Remove problematic kubernetes.list source
    sudo apt-get update --allow-releaseinfo-change || true

    sudo apt-get -y install cudnn-cuda-12

    touch ./CUDNN_INSTALLED
  fi

  # "install" cudnn-frontend to ~/
  sudo apt -y install git
  git clone https://github.com/NVIDIA/cudnn-frontend.git || true

  # install MPI (optional, if you intend to use multiple GPUs)
  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
  # installed.
  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
  # install nccl
  pip install nvidia-nccl-cu12
  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include

  git clone https://github.com/karpathy/llm.c.git || true
  cd llm.c
  ln -s ~/.cache/huggingface/fineweb10B dev/data/
  # compile llm.c (mixed precision, with cuDNN flash-attention)
  # first compilation is ~1 minute, mostly due to cuDNN
  make train_gpt2cu USE_CUDNN=1


run: |
  cd ~/llm.c
  # train on multiple GPUs
  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
      -o log124M \
      -e "d12" \
      -b 64 -t 1024 \
      -d 524288 \
      -r 1 \
      -z 1 \
      -c 0.1 \
      -l 0.0006 \
      -q 0.0 \
      -u 700 \
      -n 5000 \
      -v 250 -s 20000 \
      -h 1

  # Upload the log and model to the bucket
  rsync -Pavz log124M ~/.cache/huggingface

gpt2.yaml

name: train

resources:
  accelerators: A100:8
  # Use docker image for latest version g++ to enable the compilation of llm.c.
  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
  any_of:
    # Avoid using docker image for lambda due to the docker is not supported on
    # Lambda yet, but the base image works.
    - infra: lambda
      image_id: null
    - infra: aws
    - infra: gcp
    - infra: azure
    - infra: fluidstack
    - infra: kubernetes
  

setup: |
  conda activate base
  cd ~
  pip install tqdm tiktoken requests datasets

  # Training dependencies
  # install cudnn so we can use FlashAttention and run fast (optional)
  # https://developer.nvidia.com/cudnn-downloads
  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
  if [ -f ./CUDNN_INSTALLED ]; then
    echo "cudnn already installed"
  else
    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
    # Get version and remove the dot
    version=$(lsb_release -sr | tr -d .)
    export system_version="${system}${version}"
    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
    sudo dpkg -i cudnn-installer.deb
    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
    # Remove problematic kubernetes.list source
    sudo apt-get update --allow-releaseinfo-change || true

    sudo apt-get -y install cudnn-cuda-12

    touch ./CUDNN_INSTALLED
  fi

  # "install" cudnn-frontend to ~/
  sudo apt -y install git
  git clone https://github.com/NVIDIA/cudnn-frontend.git || true

  # install MPI (optional, if you intend to use multiple GPUs)
  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
  # installed.
  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
  # install nccl
  pip install nvidia-nccl-cu12
  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include

  git clone https://github.com/karpathy/llm.c.git || true
  cd llm.c

  # add revision to fix the dataset version, as the latest fineweb
  # dataset removed the samples, causing error:
  #   Please pass `features` or at least one example when writing data
  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py

  # compile llm.c (mixed precision, with cuDNN flash-attention)
  # first compilation is ~1 minute, mostly due to cuDNN
  make train_gpt2cu USE_CUDNN=1


run: |
  conda activate base
  cd ~/llm.c
  # Processing data
  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
  python dev/data/fineweb.py --version 10B

  # Start training on multiple GPUs
  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
      -o log124M \
      -e "d12" \
      -b 64 -t 1024 \
      -d 524288 \
      -r 1 \
      -z 1 \
      -c 0.1 \
      -l 0.0006 \
      -q 0.0 \
      -u 700 \
      -n 5000 \
      -v 250 -s 20000 \
      -h 1