Advanced Configurations#
You can pass optional configurations to SkyPilot in the ~/.sky/config.yaml
file.
Such configurations apply to all new clusters and do not affect existing clusters.
Tip
Some config fields can be overridden on a per-task basis through the experimental.config_overrides
field. See here for more details.
Syntax#
Below is the configuration syntax and some example values. See detailed explanations under each field.
api_server: endpoint: http://xx.xx.xx.xx:8000 allowed_clouds: - aws - gcp - kubernetes jobs: bucket: s3://my-bucket/ controller: resources: # same spec as 'resources' in a task YAML cloud: gcp region: us-central1 cpus: 4+ # number of vCPUs, max concurrent spot jobs = 2 * cpus disk_size: 100 docker: run_options: - -v /var/run/docker.sock:/var/run/docker.sock - --shm-size=2g nvidia_gpus: disable_ecc: false admin_policy: my_package.SkyPilotPolicyV1 kubernetes: ports: loadbalancer remote_identity: my-k8s-service-account allowed_contexts: - context1 - context2 custom_metadata: labels: mylabel: myvalue annotations: myannotation: myvalue provision_timeout: 10 autoscaler: gke pod_config: metadata: labels: my-label: my-value spec: runtimeClassName: nvidia aws: labels: map-migrated: my-value Owner: user-unique-name vpc_name: skypilot-vpc use_internal_ips: true ssh_proxy_command: ssh -W %h:%p user@host security_group_name: my-security-group disk_encrypted: false prioritize_reservations: false specific_reservations: - cr-a1234567 remote_identity: LOCAL_CREDENTIALS gcp: labels: Owner: user-unique-name my-label: my-value vpc_name: skypilot-vpc use_internal_ips: true force_enable_external_ips: true ssh_proxy_command: ssh -W %h:%p user@host prioritize_reservations: false specific_reservations: - projects/my-project/reservations/my-reservation1 managed_instance_group: run_duration: 3600 provision_timeout: 900 remote_identity: LOCAL_CREDENTIALS enable_gvnic: false azure: resource_group_vm: user-resource-group-name storage_account: user-storage-account-name oci: default: oci_config_profile: SKY_PROVISION_PROFILE compartment_ocid: ocid1.compartment.oc1..aaaaaaaahr7aicqtodxmcfor6pbqn3hvsngpftozyxzqw36gj4kh3w3kkj4q image_tag_general: skypilot:cpu-oraclelinux8 image_tag_gpu: skypilot:gpu-oraclelinux8 ap-seoul-1: vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq vcn_subnet: ocid1.subnet.oc1.ap-seoul-1.aaaaaaaa5c6wndifsij6yfyfehmi3tazn6mvhhiewqmajzcrlryurnl7nuja us-ashburn-1: vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq vcn_subnet: ocid1.subnet.oc1.iad.aaaaaaaafbj7i3aqc4ofjaapa5edakde6g4ea2yaslcsay32cthp7qo55pxa
Fields#
api_server
#
Configure the SkyPilot API server.
api_server.endpoint
#
Endpoint of the SkyPilot API server (optional).
This is used to connect to the SkyPilot API server.
Default: null
(use the local endpoint, which will be started by SkyPilot automatically).
Example:
api_server:
endpoint: http://xx.xx.xx.xx:8000
jobs
#
Custom managed jobs controller resources (optional).
These take effects only when a managed jobs controller does not already exist.
For more information about managed jobs, see Managed Jobs.
jobs.bucket
#
Bucket to store managed jobs mount files and tmp files. Bucket must already exist.
Optional. If not set, SkyPilot will create a new bucket for each managed job launch.
Supported bucket types:
jobs:
bucket: s3://my-bucket/
# bucket: gs://my-bucket/
# bucket: https://<azure_storage_account>.blob.core.windows.net/<container>
# bucket: r2://my-bucket/
# bucket: cos://<region>/<bucket>
jobs.controller
#
Configure resources for the managed jobs controller.
For more details about tuning the jobs controller resources, see Best practices for scaling up the jobs controller.
Example:
jobs:
controller:
resources: # same spec as 'resources' in a task YAML
# optionally set specific cloud/region
cloud: gcp
region: us-central1
# default resources:
cpus: 4+
memory: 8x
disk_size: 50
allowed_clouds
#
Allow list for clouds to be used in sky check
.
This field is used to restrict the clouds that SkyPilot will check and use
when running sky check
. Any cloud already enabled but not specified here
will be disabled on the next sky check
run.
If this field is not set, SkyPilot will check and use all supported clouds.
Default: null
(use all supported clouds).
docker
#
Additional Docker run options (optional).
When image_id: docker:<docker_image>
is used in a task YAML, additional
run options for starting the Docker container can be specified here.
These options will be passed directly as command line args to docker run
,
see: https://docs.docker.com/reference/cli/docker/container/run/
The following run options are applied by default and cannot be overridden:
--net=host
--cap-add=SYS_ADMIN
--device=/dev/fuse
--security-opt=apparmor:unconfined
--runtime=nvidia # Applied if nvidia GPUs are detected on the host
docker.run_options
#
This field can be useful for mounting volumes and other advanced Docker
configurations. You can specify a list of arguments or a string, where the
former will be combined into a single string with spaces. The following is
an example option for mounting the Docker socket and increasing the size of /dev/shm
:
Example:
docker:
run_options:
- -v /var/run/docker.sock:/var/run/docker.sock
- --shm-size=2g
nvidia_gpus
#
nvidia_gpus.disable_ecc
#
Disable ECC for NVIDIA GPUs (optional).
Set to true to disable ECC for NVIDIA GPUs during provisioning. This is useful to improve the GPU performance in some cases (up to 30% improvement). This will only be applied if a cluster is requested with NVIDIA GPUs. This is best-effort – not guaranteed to work on all clouds e.g., RunPod and Kubernetes does not allow rebooting the node, though RunPod has ECC disabled by default.
Note: this setting will cause a reboot during the first provisioning of the cluster, which may take a few minutes.
Reference: portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LKjOCAW
Default: false
.
admin_policy
#
Admin policy to be applied to all tasks (optional).
The policy class to be applied to all tasks, which can be used to validate and mutate user requests.
This is useful for enforcing certain policies on all tasks, such as:
Adding custom labels.
Enforcing resource limits.
Restricting cloud providers.
Requiring spot instances.
Setting autostop timeouts.
See Admin Policy Enforcement for details.
Example:
admin_policy: my_package.SkyPilotPolicyV1
aws
#
Advanced AWS configurations (optional).
Apply to all new instances but not existing ones.
aws.labels
#
Tags to assign to all instances and buckets created by SkyPilot (optional).
Example use case: cost tracking by user/team/project.
Users should guarantee that these key-values are valid AWS tags, otherwise errors from the cloud provider will be surfaced.
Example:
aws:
labels:
# (Example) AWS Migration Acceleration Program (MAP). This tag enables the
# program's discounts.
# Ref: https://docs.aws.amazon.com/mgn/latest/ug/map-program-tagging.html
map-migrated: my-value
# (Example) Useful for keeping track of who launched what. An IAM role
# can be restricted to operate on instances owned by a certain name.
# Ref: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_ec2_tag-owner.html
#
# NOTE: SkyPilot by default assigns a "skypilot-user: <username>" tag to
# all AWS/GCP/Azure instances launched by SkyPilot.
Owner: user-unique-name
# Other examples:
my-tag: my-value
aws.vpc_name
#
VPC to use in each region (optional).
If this is set, SkyPilot will only provision in regions that contain a VPC with this name (provisioner automatically looks for such regions). Regions without a VPC with this name will not be used to launch nodes.
Default: null
(use the default VPC in each region).
aws.use_internal_ips
#
Should instances be assigned private IPs only? (optional).
Set to true to use private IPs to communicate between the local client and any SkyPilot nodes. This requires the networking stack be properly set up.
When set to true
, SkyPilot will only use private subnets to launch nodes.
Private subnets are defined as those satisfying both of these properties:
Subnets whose route tables have no routes to an internet gateway (IGW);
Subnets that are configured to not assign public IPs by default (the
map_public_ip_on_launch
attribute isfalse
).
This flag is typically set together with vpc_name
above and
ssh_proxy_command
below.
Default: false
.
aws.ssh_proxy_command
#
SSH proxy command (optional).
Useful for using a jump server to communicate with SkyPilot nodes hosted
in private VPC/subnets without public IPs. Typically set together with
vpc_name
and use_internal_ips
above.
If set, this is passed as the -o ProxyCommand
option for any SSH
connections (including rsync) used to communicate between the local client
and any SkyPilot nodes. (This option is not used between SkyPilot nodes,
since they are behind the proxy / may not have such a proxy set up.)
Default: null
.
- Format 1:
A string; the same proxy command is used for all regions.
- Format 2:
A dict mapping region names to region-specific proxy commands. NOTE: This restricts SkyPilot’s search space for this cloud to only use the specified regions and not any other regions in this cloud.
Example:
aws:
# Format 1
ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
# Format 2
ssh_proxy_command:
us-east-1: ssh -W %h:%p -p 1234 -o StrictHostKeyChecking=no [email protected]
us-east-2: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
aws.security_group_name
#
Security group (optional).
Security group name to use for AWS instances. If not specified,
SkyPilot will use the default name for the security group: sky-sg-<hash>
Note: please ensure the security group name specified exists in the regions the instances are going to be launched or the AWS account has the permission to create a security group.
Some example use cases are shown below. All fields are optional.
<string>
: Apply the service account with the specified name to all instances.<list of single-element dict>
: A list of single-element dictionaries mapping from the cluster name (pattern) to the security group name to use. The matching of the cluster name is done in the same order as the list.NOTE: If none of the wildcard expressions in the dictionary match the cluster name, SkyPilot will use the default security group name as mentioned above:
sky-sg-<hash>
. To specify your default, use*
as the wildcard expression.
Example:
aws:
# Format 1
security_group_name: my-security-group
# Format 2
security_group_name:
- my-cluster-name: my-security-group-1
- sky-serve-controller-*: my-security-group-2
- "*": my-default-security-group
aws.disk_encrypted
#
Encrypted boot disk (optional).
Set to true
to encrypt the boot disk of all AWS instances launched by
SkyPilot. This is useful for compliance with data protection regulations.
Default: false
.
aws.prioritize_reservations
#
Reserved capacity (optional).
Whether to prioritize capacity reservations (considered as 0 cost) in the optimizer.
If you have capacity reservations in your AWS project:
Setting this to true
guarantees the optimizer will pick any matching
reservation within all regions and AWS will auto consume your reservations
with instance match criteria to “open”, and setting to false
means
optimizer uses regular, non-zero pricing in optimization (if by chance any
matching reservation exists, AWS will still consume the reservation).
Note: this setting is default to false
for performance reasons, as it can
take half a minute to retrieve the reservations from AWS when set to true
.
Default: false
.
aws.specific_reservations
#
The targeted capacity reservations (CapacityReservationId
) to be
considered when provisioning clusters on AWS. SkyPilot will automatically
prioritize this reserved capacity (considered as zero cost) if the
requested resources matches the reservation.
Ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-launch.html
Example:
aws:
specific_reservations:
- cr-a1234567
- cr-b2345678
aws.remote_identity
#
Identity to use for AWS instances (optional).
Supported values:
LOCAL_CREDENTIALS: The user’s local credential files will be uploaded to AWS instances created by SkyPilot. These credentials are used for:
Accessing cloud resources (e.g., private buckets).
Launching new instances (e.g., for jobs/serve controllers).
SERVICE_ACCOUNT: Local credential files are not uploaded to AWS instances. Instead: - SkyPilot will auto-create and reuse a service account (IAM role) for AWS instances.
NO_UPLOAD: No credentials will be uploaded to instances. This is useful to avoid overriding any existing credentials that may already be automounted on the cluster.
Customized service account (IAM role): Specify this as either a
<string>
or a<list of single-element dict>
:<string>: Apply the service account with the specified name to all instances.
<list of single-element dict>: A list of single-element dictionaries mapping cluster names (patterns) to service account names.
Matching of cluster names is done in the same order as the list.
If no wildcard expression matches the cluster name,
LOCAL_CREDENTIALS
will be used.To specify a default, use
*
as the wildcard expression.
—
Caveats for SERVICE_ACCOUNT with multicloud users
This setting only affects AWS instances. Local AWS credentials will still be uploaded to non-AWS instances (since those may need access to AWS resources). To fully disable credential uploads, set
remote_identity: NO_UPLOAD
.If the SkyPilot jobs/serve controller is on AWS: - Non-AWS managed jobs or non-AWS service replicas will fail to access AWS resources. - This occurs because the controllers won’t have AWS credential files to assign to these non-AWS instances.
—
Example configuration
aws:
# Format 1
remote_identity: my-service-account-name
# Format 2
remote_identity:
- my-cluster-name: my-service-account-1
- sky-serve-controller-*: my-service-account-2
- "*": my-default-service-account
gcp
#
Advanced GCP configurations (optional).
Apply to all new instances but not existing ones.
gcp.labels
#
Labels to assign to all instances launched by SkyPilot (optional).
Example use case: cost tracking by user/team/project.
Users should guarantee that these key-values are valid GCP labels, otherwise errors from the cloud provider will be surfaced.
Example:
gcp:
labels:
Owner: user-unique-name
my-label: my-value
gcp.vpc_name
#
VPC to use (optional).
Default: null
, which implies the following behavior. First, all existing
VPCs in the project are checked against the minimal recommended firewall
rules for SkyPilot to function. If any VPC satisfies these rules, it is
used. Otherwise, a new VPC named skypilot-vpc
is automatically created
with the minimal recommended firewall rules and will be used.
If this field is set, SkyPilot will use the VPC with this name. Useful for
when users want to manually set up a VPC and precisely control its
firewall rules. If no region restrictions are given, SkyPilot only
provisions in regions for which a subnet of this VPC exists. Errors are
thrown if VPC with this name is not found. The VPC does not get modified
in any way, except when opening ports (e.g., via resources.ports
) in
which case new firewall rules permitting public traffic to those ports
will be added.
gcp.use_internal_ips
#
Should instances be assigned private IPs only? (optional).
Set to true
to use private IPs to communicate between the local client and
any SkyPilot nodes. This requires the networking stack be properly set up.
This flag is typically set together with vpc_name
above and
ssh_proxy_command
below.
Default: false
.
gcp.force_enable_external_ips
#
Should instances in a vpc where communicated with via internal IPs still have an external IP? (optional).
Set to true
to force VMs to be assigned an exteral IP even when
vpc_name
and use_internal_ips
are set.
Default: false
.
gcp.ssh_proxy_command
#
SSH proxy command (optional).
Please refer to the aws.ssh_proxy_command section above for more details.
- Format 1:
A string; the same proxy command is used for all regions.
- Format 2:
A dict mapping region names to region-specific proxy commands. NOTE: This restricts SkyPilot’s search space for this cloud to only use the specified regions and not any other regions in this cloud.
Example:
gcp:
# Format 1
ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no gcpuser@<jump server public ip>
# Format 2
ssh_proxy_command:
us-central1: ssh -W %h:%p -p 1234 -o StrictHostKeyChecking=no [email protected]
us-west1: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no gcpuser@<jump server public ip>
gcp.prioritize_reservations
#
Reserved capacity (optional).
Whether to prioritize reserved instance types/locations (considered as 0 cost) in the optimizer.
- If you have “automatically consumed” reservations in your GCP project:
Setting this to
true
guarantees the optimizer will pick any matching reservation and GCP will auto consume your reservation, and setting tofalse
means optimizer uses regular, non-zero pricing in optimization (if by chance any matching reservation exists, GCP still auto consumes the reservation).
- If you have “specifically targeted” reservations (set by the
specific_reservations
field below): This field will automatically be set to
true
.
Note: this setting is default to false
for performance reasons, as it can
take half a minute to retrieve the reservations from GCP when set to true
.
Default: false
.
gcp.specific_reservations
#
The “specifically targeted” reservations to be considered when provisioning clusters on GCP. SkyPilot will automatically prioritize this reserved capacity (considered as zero cost) if the requested resources matches the reservation.
Ref: https://cloud.google.com/compute/docs/instances/reservations-overview#consumption-type
Example:
gcp:
specific_reservations:
- projects/my-project/reservations/my-reservation1
- projects/my-project/reservations/my-reservation2
gcp.managed_instance_group
#
Managed instance group / DWS (optional).
SkyPilot supports launching instances in a managed instance group (MIG) which schedules the GPU instance creation through DWS, offering a better availability. This feature is only applied when a resource request contains GPU instances.
run_duration
: Duration for a created instance to be kept alive (in seconds, required).
This is required for the DWS to work properly. After the specified duration,
the instance will be terminated.
provision_timeout
: Timeout for provisioning an instance by DWS (in seconds, optional).
This timeout determines how long SkyPilot will wait for a managed instance
group to create the requested resources before giving up, deleting the MIG
and failing over to other locations. Larger timeouts may increase the chance
for getting a resource, but will block failover to go to other zones/regions/clouds.
Default: 900
.
Example:
gcp:
managed_instance_group:
run_duration: 3600
provision_timeout: 900
gcp.remote_identity
#
Identity to use for GCP instances (optional).
Please refer to the aws.remote_identity section above for more details.
Default: LOCAL_CREDENTIALS
.
gcp.enable_gvnic
#
Enable gVNIC network interface (optional).
Set to true to enable gVNIC network interface for all GCP instances launched by SkyPilot. This is useful for improving network performance.
Default: false
.
azure
#
Advanced Azure configurations (optional).
azure.resource_group_vm
#
Resource group for VM resources (optional).
Name of the resource group to use for VM resources. If not specified, SkyPilot will create a new resource group with a default name.
azure.storage_account
#
Storage account name (optional).
Name of the storage account to use. If not specified, SkyPilot will create a new storage account with a default name.
Example:
azure:
resource_group_vm: user-resource-group-name
storage_account: user-storage-account-name
kubernetes
#
Advanced Kubernetes configurations (optional).
kubernetes.ports
#
Port configuration mode (optional).
Can be one of:
loadbalancer
: Use LoadBalancer service to expose ports.nodeport
: Use NodePort service to expose ports.
Default: loadbalancer
.
kubernetes.remote_identity
#
Service account for remote authentication (optional).
Name of the service account to use for remote authentication.
kubernetes.allowed_contexts
#
List of allowed Kubernetes contexts (optional).
List of context names that SkyPilot is allowed to use.
kubernetes.custom_metadata
#
Custom metadata for Kubernetes resources (optional).
Custom labels and annotations to apply to all Kubernetes resources.
kubernetes.provision_timeout
#
Timeout for resource provisioning (optional).
Timeout in minutes for resource provisioning.
Default: 10
.
kubernetes.autoscaler
#
Autoscaler type (optional).
Type of autoscaler used by the underlying Kubernetes cluster. Used to configure the GPU labels used by the pods submitted by SkyPilot.
Can be one of:
gke
: Google Kubernetes Enginekarpenter
: Karpentergeneric
: Generic autoscaler, assumes nodes are labelled withskypilot.co/accelerator
.
kubernetes.pod_config
#
Pod configuration settings (optional).
Additional pod configuration settings to apply to all pods.
Example:
kubernetes:
networking: portforward
ports: loadbalancer
remote_identity: my-k8s-service-account
allowed_contexts:
- context1
- context2
custom_metadata:
labels:
mylabel: myvalue
annotations:
myannotation: myvalue
provision_timeout: 10
autoscaler: gke
pod_config:
metadata:
labels:
my-label: my-value
spec:
runtimeClassName: nvidia
imagePullSecrets:
- name: my-secret
containers:
- env:
- name: HTTP_PROXY
value: http://proxy-host:3128
volumeMounts:
- mountPath: /foo
name: example-volume
readOnly: true
volumes:
- name: example-volume
hostPath:
path: /tmp
type: Directory
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 3Gi
oci
#
Advanced OCI configurations (optional).
oci_config_profile
The profile name in
~/.oci/config
to use for launching instances. Default:DEFAULT
compartment_ocid
The OCID of the compartment to use for launching instances. If not set, the root compartment will be used (optional).
image_tag_general
The default image tag to use for launching general instances (CPU) if the
image_id
parameter is not specified. Default:skypilot:cpu-ubuntu-2204
image_tag_gpu
The default image tag to use for launching GPU instances if the
image_id
parameter is not specified. Default:skypilot:gpu-ubuntu-2204
The configuration can be specified either in the default
section (applying to all regions unless overridden) or in region-specific sections.
Example:
oci:
# Region-specific configurations
ap-seoul-1:
# The OCID of the VCN to use for instances (optional).
vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq
# The OCID of the subnet to use for instances (optional).
vcn_subnet: ocid1.subnet.oc1.ap-seoul-1.aaaaaaaa5c6wndifsij6yfyfehmi3tazn6mvhhiewqmajzcrlryurnl7nuja
us-ashburn-1:
vcn_ocid: ocid1.vcn.oc1.ap-seoul-1.amaaaaaaak7gbriarkfs2ssus5mh347ktmi3xa72tadajep6asio3ubqgarq
vcn_subnet: ocid1.subnet.oc1.iad.aaaaaaaafbj7i3aqc4ofjaapa5edakde6g4ea2yaslcsay32cthp7qo55pxa