Cloud Object Storage#

SkyPilot tasks can access data from buckets in cloud object storages such as AWS S3, Google Cloud Storage (GCS), Cloudflare R2 or IBM COS.

Buckets are made available to each task at a local path on the remote VM, so the task can access bucket objects as if they were local files.

Usage#

Object storages are specified using the file_mounts field in a SkyPilot task.

To access an existing bucket (e.g., created through cloud CLI or other tools), specify source.

# Mount an existing S3 bucket
file_mounts:
  /my_data:
    source: s3://my-bucket/  # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>
    mode: MOUNT  # Optional: either MOUNT or COPY. Defaults to MOUNT.

This will mount the contents of the bucket at s3://my-bucket/ to the remote VM at /my_data.

To create an empty bucket, specify name.

# Create an empty gcs bucket
file_mounts:
  /my_data:
    name: my-sky-bucket
    store: gcs  # Optional: either of s3, gcs, azure, r2, ibm

SkyPilot will create an empty GCS bucket called my-sky-bucket and mount it at /my_data. This bucket can be used to write checkpoints, logs or other outputs directly to the cloud.

Note

name must be unique to create a new bucket. If the bucket already exists and was created by SkyPilot, SkyPilot will fetch and reuse the bucket, so the same YAML can be reused across runs.

To create a new bucket, upload local files to this bucket and attach it to the task, specify name and source, where source is a local path.

# Create a new S3 bucket and upload local data
file_mounts:
  /my_data:
    name: my-sky-bucket
    source: ~/dataset  # Optional: path to local data to upload to the bucket
    store: s3  # Optional: either of s3, gcs, azure, r2, ibm
    mode: MOUNT  # Optional: either MOUNT or COPY. Defaults to MOUNT.

SkyPilot will create a S3 bucket called my-sky-bucket and upload the contents of ~/dataset to it. The bucket will then be mounted at /my_data and your data will be available to the task.

If store is omitted, SkyPilot will use the same cloud provider as the task’s cloud.

Note

If the bucket already exists and was created by SkyPilot, SkyPilot will fetch and reuse the bucket. If any files at source have changed, SkyPilot will automatically sync the new files to the bucket at the start of the task.

You can find more detailed usage examples in storage_demo.yaml.

Storage modes#

A cloud storage can be used in either MOUNT mode or COPY mode.

  1. MOUNT mode: The bucket is directly “mounted” to the remote VM. I.e., files are streamed when accessed by the task and all writes are replicated to the remote bucket. Any writes will also appear on other VMs mounting the same bucket. This is the default mode.

  2. COPY mode: The files are pre-fetched and cached on the local disk. Writes only affect the local copy and are not streamed back to the bucket.

sky-storage-modes

Picking a storage mode#

Choosing between MOUNT and COPY modes depends on the workload, its performance requirements and size of the data.

mode: MOUNT

mode: COPY

Best for

Writing task outputs (e.g., checkpoints, logs); reading very large data that won’t fit on disk.

High performance read-only access to datasets that fit on disk.

Performance

🟡 Slow to read/write files. Fast to provision.

✅ Fast file access. Slow at initial provisioning.

Writing to buckets

🟡 Most write operations [1] are supported.

❌ Not supported. Read-only.

Disk Size

✅ No disk size requirements [2] .

🟡 VM disk size must be greater than the size of the bucket.

Note

Under the hood, MOUNT mode uses FUSE to provide a close-to-open consistency model for attached buckets. This means calling close() on a file will upload the entire file to the bucket. Any subsequent reads will see the latest data.

Note

SkyPilot does not guarantee preservation of file permissions when attaching buckets. You may need to set file permissions during task execution.

Note

Symbolic links are handled differently in file_mounts depending on whether buckets are used. For bucket mounts, symbolic links are not copied to remote. For local file_mounts that are directly rsynced to the VM, the symbolic links are directly copied, not their target data. The targets must be separately mounted or else the symlinks may break.

Common patterns#

Reading datasets#

If your dataset is already in a cloud bucket, you can directly mount it to your remote VM.

# Mount an existing S3 bucket containing a dataset
file_mounts:
  /my_data:
    source: s3://my-dataset/
    mode: MOUNT

Tip

If your dataset can fit on the VM’s disk, you can use mode: COPY to improve the I/O performance of your task. See Storage modes for more details.

Storing task outputs#

You can directly write the outputs of your tasks to a cloud bucket by creating a new bucket and using it in MOUNT mode.

💡 Example use case: Writing model checkpoints, logs from training runs.

# Creates an empty bucket. Any writes to /my_data will be replicated to s3://my-sky-bucket
file_mounts:
  /my_data:
    name: my-sky-bucket
    store: s3

Your task can then write files to /my_data and they will be automatically be uploaded to the cloud.

Avoid re-uploading data on every run#

Compared to directly using local paths in file_mounts, uploading to a bucket can be faster because it is persistent and thus requires fewer uploads from your local machine.

💡 Example use case: Uploading local dataset or files once and using it in multiple tasks.

# Creates a bucket and reuses it in multiple tasks and runs
file_mounts:
  /my_data:
    name: my-sky-bucket
    source: ~/my_local_path
    store: s3

Note

If the data at source changes, new files will be automatically synced to the bucket.

Shared file system across workers#

A bucket used in MOUNT mode can act as a shared file system across workers running on different nodes.

💡 Example use case: Inter-Process Communication (IPC), sharing data and outputs between workers.

num_nodes: 2

# This bucket will be mounted at /my_data on both nodes.
file_mounts:
  /my_data:
    name: my-sky-bucket
    store: s3

Using SkyPilot Storage CLI#

To manage buckets created by SkyPilot, the sky CLI provides two commands: sky storage ls and sky storage delete.

  1. sky storage ls shows buckets created by SkyPilot.

$ sky storage ls
NAME               CREATED     STORE  COMMAND                                        STATUS
sky-dataset        3 mins ago  S3     sky launch -c demo examples/storage_demo.yaml  READY
  1. sky storage delete allows you to delete any buckets created by SkyPilot.

$ sky storage delete sky-dataset
Deleting storage object sky-dataset...
I 04-02 19:42:24 storage.py:336] Detected existing storage object, loading Storage: sky-dataset
I 04-02 19:42:26 storage.py:683] Deleting S3 Bucket sky-dataset

Note

sky storage ls only shows storage that were created by SkyPilot. Externally created buckets or public buckets are not listed in sky storage ls and cannot be managed through SkyPilot.

Storage YAML reference#

file_mounts:
  /remote_path:
    name: str
      Identifier for the storage object. Used when creating a new storage
      or referencing an existing storage created by SkyPilot. Not required
      when using an existing bucket created externally.

    source: str
      The source attribute specifies the path that must be made available
      in the storage object. It can either be:
      - A local path
      - A list of local paths
      - A remote path using one of the following formats:
        - s3://<bucket_name>
        - gs://<bucket_name>
        - https://<azure_storage_account>.blob.core.windows.net/<container_name>
        - r2://<bucket_name>
        - cos://<region_name>/<bucket_name>

      If the source is local, data is uploaded to the cloud to an appropriate
      bucket (s3, gcs, azure, r2, or ibm). If source is bucket URI,
      the data is copied or mounted directly (see mode flag below).

    store: str; either of 's3', 'gcs', 'azure', 'r2', 'ibm'
      If you wish to force sky.Storage to be backed by a specific cloud object
      storage, you can specify it here. If not specified, SkyPilot chooses the
      appropriate object storage based on the source path and task's cloud provider.

    persistent: bool; default: True.
      Whether the remote backing stores in the cloud should be deleted after
      execution of the task. Set to True to avoid uploading files again
      in subsequent runs (at the cost of storing your data in the cloud). If
      files change between runs, new files are synced to the bucket.

    mode: str; either of MOUNT or COPY; default: MOUNT
      Whether attach the bucket by copying files, or mounting the remote
      bucket. With MOUNT mode, files are streamed from the remote bucket
      and writes are replicated to the object store (and consequently, to
      other workers mounting the same Storage). With COPY mode, files are
      copied at VM initialization and any writes to the mount path will
      not be replicated on the bucket.