Command Line Interface#
Core CLI#
sky launch#
Launch a cluster or task.
If ENTRYPOINT points to a valid YAML file, it is read in as the task specification. Otherwise, it is interpreted as a bash command.
In both cases, the commands are run under the task’s workdir (if specified) and they undergo job queue scheduling.
sky launch [OPTIONS] [ENTRYPOINT]...
Options
- -c, --cluster <cluster>#
A cluster name. If provided, either reuse an existing cluster with that name or provision a new cluster with that name. Otherwise provision a new cluster with an autogenerated name.
- --dryrun#
If True, do not actually run the job.
- -s, --detach-setup#
If True, run setup in non-interactive mode as part of the job itself. You can safely ctrl-c to detach from logging, and it will not interrupt the setup process. To see the logs again after detaching, use sky logs. To cancel setup, cancel the job via sky cancel. Useful for long-running setup commands.
- -d, --detach-run#
If True, as soon as a job is submitted, return from this call and do not stream execution logs.
- --docker#
If used, runs locally inside a docker container.
- -n, --name <name>#
Task name. Overrides the “name” config in the YAML if both are supplied.
- --workdir <workdir>#
If specified, sync this dir to the remote working directory, where the task will be invoked. Overrides the “workdir” config in the YAML if both are supplied.
- --cloud <cloud>#
The cloud to use. If specified, overrides the “resources.cloud” config. Passing “none” resets the config.
- --region <region>#
The region to use. If specified, overrides the “resources.region” config. Passing “none” resets the config.
- --zone <zone>#
The zone to use. If specified, overrides the “resources.zone” config. Passing “none” resets the config.
- --num-nodes <num_nodes>#
Number of nodes to execute the task on. Overrides the “num_nodes” config in the YAML if both are supplied.
- --cpus <cpus>#
Number of vCPUs each instance must have (e.g.,
--cpus=4
(exactly 4) or--cpus=4+
(at least 4)). This is used to automatically select the instance type.
- --memory <memory>#
Amount of memory each instance must have in GB (e.g.,
--memory=16
(exactly 16GB),--memory=16+
(at least 16GB))
- --disk-size <disk_size>#
OS disk size in GBs.
- --disk-tier <disk_tier>#
OS disk tier. Could be one of low, medium, high, best, none. If best is specified, use the best possible disk tier. If none is specified, enforce to use default value and override the option in task YAML. Default: medium
- Options:
low | medium | high | best | none
- --use-spot, --no-use-spot#
Whether to request spot instances. If specified, overrides the “resources.use_spot” config.
- --image-id <image_id>#
Custom image id for launching the instances. Passing “none” resets the config.
- --env-file <env_file>#
Path to a dotenv file with environment variables to set on the remote node.
If any values from
--env-file
conflict with values set by--env
, the--env
value will be preferred.
- --env <env>#
Environment variable to set on the remote node. It can be specified multiple times. Examples:
1.--env MY_ENV=1
: set$MY_ENV
on the cluster to be 1.2.
--env MY_ENV2=$HOME
: set$MY_ENV2
on the cluster to be the same value of$HOME
in the local environment where the CLI command is run.3.
--env MY_ENV3
: set$MY_ENV3
on the cluster to be the same value of$MY_ENV3
in the local environment.
- --gpus <gpus>#
Type and number of GPUs to use. Example values: “V100:8”, “V100” (short for a count of 1), or “V100:0.5” (fractional counts are supported by the scheduling framework). If a new cluster is being launched by this command, this is the resources to provision. If an existing cluster is being reused, this is seen as the task demand, which must fit the cluster’s total resources and is used for scheduling the task. Overrides the “accelerators” config in the YAML if both are supplied. Passing “none” resets the config.
- -t, --instance-type <instance_type>#
The instance type to use. If specified, overrides the “resources.instance_type” config. Passing “none” resets the config.
- --ports <ports>#
Ports to open on the cluster. If specified, overrides the “ports” config in the YAML.
- -i, --idle-minutes-to-autostop <idle_minutes_to_autostop>#
Automatically stop the cluster after this many minutes of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/running/pending jobs are found in the job queue. Setting this flag is equivalent to running
sky launch -d ...
and thensky autostop -i <minutes>
. If not set, the cluster will not be autostopped.
- --down#
Autodown the cluster: tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.
- -r, --retry-until-up#
Whether to retry provisioning infinitely until the cluster is up, if we fail to launch the cluster on any possible region/cloud due to unavailability errors.
- -y, --yes#
Skip confirmation prompt.
- --no-setup#
Skip setup phase when (re-)launching cluster.
- --clone-disk-from, --clone <clone_disk_from>#
[Experimental] Clone disk from an existing cluster to launch a new one. This is useful when the new cluster needs to have the same data on the boot disk as an existing cluster.
Arguments
- ENTRYPOINT#
Optional argument(s)
sky exec#
Execute a task or command on an existing cluster.
If ENTRYPOINT points to a valid YAML file, it is read in as the task specification. Otherwise, it is interpreted as a bash command.
Actions performed by sky exec
:
Workdir syncing, if:
ENTRYPOINT is a YAML with the
workdir
field specified; orFlag
--workdir=<local_dir>
is set.
Executing the specified task’s
run
commands / the bash command.
sky exec
is thus typically faster than sky launch
, provided a
cluster already exists.
All setup steps (provisioning, setup commands, file mounts syncing) are
skipped. If any of those specifications changed, this command will not
reflect those changes. To ensure a cluster’s setup is up to date, use sky
launch
instead.
Execution and scheduling behavior:
The task/command will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.
The task/command is run under the workdir (if specified).
The task/command is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as
htop
do not work. Usessh my_cluster
instead.
Typical workflow:
# First command: set up the cluster once.
sky launch -c mycluster app.yaml
# For iterative development, simply execute the task on the launched
# cluster.
sky exec mycluster app.yaml
# Do "sky launch" again if anything other than Task.run is modified:
sky launch -c mycluster app.yaml
# Pass in commands for execution.
sky exec mycluster python train_cpu.py
sky exec mycluster --gpus=V100:1 python train_gpu.py
# Pass environment variables to the task.
sky exec mycluster --env WANDB_API_KEY python train_gpu.py
sky exec [OPTIONS] CLUSTER ENTRYPOINT...
Options
- -d, --detach-run#
If True, as soon as a job is submitted, return from this call and do not stream execution logs.
- -n, --name <name>#
Task name. Overrides the “name” config in the YAML if both are supplied.
- --workdir <workdir>#
If specified, sync this dir to the remote working directory, where the task will be invoked. Overrides the “workdir” config in the YAML if both are supplied.
- --cloud <cloud>#
The cloud to use. If specified, overrides the “resources.cloud” config. Passing “none” resets the config.
- --region <region>#
The region to use. If specified, overrides the “resources.region” config. Passing “none” resets the config.
- --zone <zone>#
The zone to use. If specified, overrides the “resources.zone” config. Passing “none” resets the config.
- --num-nodes <num_nodes>#
Number of nodes to execute the task on. Overrides the “num_nodes” config in the YAML if both are supplied.
- --cpus <cpus>#
Number of vCPUs each instance must have (e.g.,
--cpus=4
(exactly 4) or--cpus=4+
(at least 4)). This is used to automatically select the instance type.
- --memory <memory>#
Amount of memory each instance must have in GB (e.g.,
--memory=16
(exactly 16GB),--memory=16+
(at least 16GB))
- --disk-size <disk_size>#
OS disk size in GBs.
- --disk-tier <disk_tier>#
OS disk tier. Could be one of low, medium, high, best, none. If best is specified, use the best possible disk tier. If none is specified, enforce to use default value and override the option in task YAML. Default: medium
- Options:
low | medium | high | best | none
- --use-spot, --no-use-spot#
Whether to request spot instances. If specified, overrides the “resources.use_spot” config.
- --image-id <image_id>#
Custom image id for launching the instances. Passing “none” resets the config.
- --env-file <env_file>#
Path to a dotenv file with environment variables to set on the remote node.
If any values from
--env-file
conflict with values set by--env
, the--env
value will be preferred.
- --env <env>#
Environment variable to set on the remote node. It can be specified multiple times. Examples:
1.--env MY_ENV=1
: set$MY_ENV
on the cluster to be 1.2.
--env MY_ENV2=$HOME
: set$MY_ENV2
on the cluster to be the same value of$HOME
in the local environment where the CLI command is run.3.
--env MY_ENV3
: set$MY_ENV3
on the cluster to be the same value of$MY_ENV3
in the local environment.
- --gpus <gpus>#
Type and number of GPUs to use. Example values: “V100:8”, “V100” (short for a count of 1), or “V100:0.5” (fractional counts are supported by the scheduling framework). If a new cluster is being launched by this command, this is the resources to provision. If an existing cluster is being reused, this is seen as the task demand, which must fit the cluster’s total resources and is used for scheduling the task. Overrides the “accelerators” config in the YAML if both are supplied. Passing “none” resets the config.
- -t, --instance-type <instance_type>#
The instance type to use. If specified, overrides the “resources.instance_type” config. Passing “none” resets the config.
- --ports <ports>#
Ports to open on the cluster. If specified, overrides the “ports” config in the YAML.
Arguments
- CLUSTER#
Required argument
- ENTRYPOINT#
Required argument(s)
sky stop#
Stop cluster(s).
CLUSTER is the name (or glob pattern) of the cluster to stop. If both
CLUSTER and --all
are supplied, the latter takes precedence.
Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.
Currently, spot instance clusters cannot be stopped.
Examples:
# Stop a specific cluster.
sky stop cluster_name
# Stop multiple clusters.
sky stop cluster1 cluster2
# Stop all clusters matching glob pattern 'cluster*'.
sky stop "cluster*"
# Stop all existing clusters.
sky stop -a
sky stop [OPTIONS] [CLUSTERS]...
Options
- -a, --all#
Stop all existing clusters.
- -y, --yes#
Skip confirmation prompt.
Arguments
- CLUSTERS#
Optional argument(s)
sky start#
Restart cluster(s).
If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this command will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.
Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.
If a cluster is already in the UP status, this command has no effect.
Examples:
# Restart a specific cluster.
sky start cluster_name
# Restart multiple clusters.
sky start cluster1 cluster2
# Restart all clusters.
sky start -a
sky start [OPTIONS] [CLUSTERS]...
Options
- -a, --all#
Start all existing clusters.
- -y, --yes#
Skip confirmation prompt.
- -i, --idle-minutes-to-autostop <idle_minutes_to_autostop>#
Automatically stop the cluster after this many minutes of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/running/pending jobs are found in the job queue. Setting this flag is equivalent to running
sky launch -d ...
and thensky autostop -i <minutes>
. If not set, the cluster will not be autostopped.
- --down#
Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requires –idle-minutes-to-autostop to be set.
- -r, --retry-until-up#
Retry provisioning infinitely until the cluster is up, if we fail to start the cluster due to unavailability errors.
- -f, --force#
Force start the cluster even if it is already UP. Useful for upgrading the SkyPilot runtime on the cluster.
Arguments
- CLUSTERS#
Optional argument(s)
sky down#
Tear down cluster(s).
CLUSTER is the name of the cluster (or glob pattern) to tear down. If both
CLUSTER and --all
are supplied, the latter takes precedence.
Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.
Examples:
# Tear down a specific cluster.
sky down cluster_name
# Tear down multiple clusters.
sky down cluster1 cluster2
# Tear down all clusters matching glob pattern 'cluster*'.
sky down "cluster*"
# Tear down all existing clusters.
sky down -a
sky down [OPTIONS] [CLUSTERS]...
Options
- -a, --all#
Tear down all existing clusters.
- -y, --yes#
Skip confirmation prompt.
- -p, --purge#
Ignore cloud provider errors (if any). Useful for cleaning up manually deleted cluster(s).
Arguments
- CLUSTERS#
Optional argument(s)
sky status#
Show clusters.
If CLUSTERS is given, show those clusters. Otherwise, show all clusters.
If –ip is specified, show the IP address of the head node of the cluster.
Only available when CLUSTERS contains exactly one cluster, e.g.
sky status --ip mycluster
.
If –endpoints is specified, show all exposed endpoints in the cluster.
Only available when CLUSTERS contains exactly one cluster, e.g.
sky status --endpoints mycluster
. To query a single endpoint, you
can use sky status mycluster --endpoint 8888
.
The following fields for each cluster are recorded: cluster name, time since last launch, resources, region, zone, hourly price, status, autostop, command.
Display all fields using sky status -a
.
Each cluster can have one of the following statuses:
INIT
: The cluster may be live or down. It can happen in the following cases:Ongoing provisioning or runtime setup. (A
sky launch
has started but has not completed.)Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try
sky launch
again on it.)
UP
: Provisioning and runtime setup have succeeded and the cluster is live. (The most recentsky launch
has completed successfully.)STOPPED
: The cluster is stopped and the storage is persisted. Usesky start
to restart the cluster.
Autostop column:
Indicates after how many minutes of idleness (no in-progress jobs) the cluster will be autostopped. ‘-’ means disabled.
If the time is followed by ‘(down)’, e.g., ‘1m (down)’, the cluster will be autodowned, rather than autostopped.
Getting up-to-date cluster statuses:
In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.
In cases where clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use
--refresh
to query the latest cluster statuses from the cloud providers.
sky status [OPTIONS] [CLUSTERS]...
Options
- -a, --all#
Show all information in full.
- -r, --refresh#
Query the latest cluster statuses from the cloud provider(s).
- --ip#
Get the IP address of the head node of a cluster. This option will override all other options. For Kubernetes clusters, the returned IP address is the internal IP of the head pod, and may not be accessible from outside the cluster.
- --endpoints#
Get all exposed endpoints and corresponding URLs for acluster. This option will override all other options.
- --endpoint <endpoint>#
Get the endpoint URL for the specified port number on the cluster. This option will override all other options.
- --show-spot-jobs, --no-show-spot-jobs#
Also show recent in-progress spot jobs, if any.
- --show-services, --no-show-services#
Also show sky serve services, if any.
Arguments
- CLUSTERS#
Optional argument(s)
sky autostop#
Schedule an autostop or autodown for cluster(s).
Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.
CLUSTERS are the names (or glob patterns) of the clusters to stop. If both
CLUSTERS and --all
are supplied, the latter takes precedence.
Idleness time of a cluster is reset to zero, when any of these happens:
A job is submitted (
sky launch
orsky exec
).The cluster has restarted.
An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.
Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.
When multiple autostop settings are specified for the same cluster, the last setting takes precedence.
Typical usage:
# Autostop this cluster after 60 minutes of idleness.
sky autostop cluster_name -i 60
# Cancel autostop for a specific cluster.
sky autostop cluster_name --cancel
# Since autostop was canceled in the last command, idleness will
# restart counting after this command.
sky autostop cluster_name -i 60
sky autostop [OPTIONS] [CLUSTERS]...
Options
- -a, --all#
Apply this command to all existing clusters.
- -i, --idle-minutes <idle_minutes>#
Set the idle minutes before autostopping the cluster. See the doc above for detailed semantics.
- --cancel#
Cancel any currently active auto{stop,down} setting for the cluster. No-op if there is no active setting.
- --down#
Use autodown (tear down the cluster; non-restartable), instead of autostop (restartable).
- -y, --yes#
Skip confirmation prompt.
Arguments
- CLUSTERS#
Optional argument(s)
Job Queue CLI#
sky queue#
Show the job queue for cluster(s).
sky queue [OPTIONS] [CLUSTERS]...
Options
- -a, --all-users#
Show all users’ information in full.
- -s, --skip-finished#
Show only pending/running jobs’ information.
Arguments
- CLUSTERS#
Optional argument(s)
sky logs#
Tail the log of a job.
If JOB_ID is not provided, the latest job on the cluster will be used.
1. If no flags are provided, tail the logs of the job_id specified. At most one job_id can be provided.
2. If --status
is specified, print the status of the job and exit with
returncode 0 if the job succeeded, or 1 otherwise. At most one job_id can
be specified.
3. If --sync-down
is specified, the logs of the job will be downloaded
from the cluster and saved to the local machine under
~/sky_logs
. Mulitple job_ids can be specified.
sky logs [OPTIONS] CLUSTER [JOB_IDS]...
Options
- -s, --sync-down#
Sync down the logs of a job to the local machine. For a distributed job, a separate log file from each worker will be downloaded.
- --status#
If specified, do not show logs but exit with a status code for the job’s status: 0 for succeeded, or 1 for all other statuses.
- --follow, --no-follow#
Follow the logs of a job. If –no-follow is specified, print the log so far and exit. [default: –follow]
Arguments
- CLUSTER#
Required argument
- JOB_IDS#
Optional argument(s)
sky cancel#
Cancel job(s).
Example usage:
# Cancel specific jobs on a cluster.
sky cancel cluster_name 1
sky cancel cluster_name 1 2 3
# Cancel all jobs on a cluster.
sky cancel cluster_name -a
# Cancel the latest running job on a cluster.
sky cancel cluster_name
Job IDs can be looked up by sky queue cluster_name
.
sky cancel [OPTIONS] CLUSTER [JOBS]...
Options
- -a, --all#
Cancel all jobs on the specified cluster.
- -y, --yes#
Skip confirmation prompt.
Arguments
- CLUSTER#
Required argument
- JOBS#
Optional argument(s)
Sky Serve CLI#
sky serve up#
Launch a SkyServe service.
SERVICE_YAML must point to a valid YAML file.
A regular task YAML can be turned into a service YAML by adding a service field. E.g.,
# service.yaml
service:
ports: 8080
readiness_probe:
path: /health
initial_delay_seconds: 20
replicas: 1
resources:
cpus: 2+
run: python -m http.server 8080
Example:
sky serve up service.yaml
sky serve up [OPTIONS] SERVICE_YAML...
Options
- -n, --service-name <service_name>#
A service name. Unique for each service. If not provided, a unique name is autogenerated.
- --workdir <workdir>#
If specified, sync this dir to the remote working directory, where the task will be invoked. Overrides the “workdir” config in the YAML if both are supplied.
- --cloud <cloud>#
The cloud to use. If specified, overrides the “resources.cloud” config. Passing “none” resets the config.
- --region <region>#
The region to use. If specified, overrides the “resources.region” config. Passing “none” resets the config.
- --zone <zone>#
The zone to use. If specified, overrides the “resources.zone” config. Passing “none” resets the config.
- --num-nodes <num_nodes>#
Number of nodes to execute the task on. Overrides the “num_nodes” config in the YAML if both are supplied.
- --cpus <cpus>#
Number of vCPUs each instance must have (e.g.,
--cpus=4
(exactly 4) or--cpus=4+
(at least 4)). This is used to automatically select the instance type.
- --memory <memory>#
Amount of memory each instance must have in GB (e.g.,
--memory=16
(exactly 16GB),--memory=16+
(at least 16GB))
- --disk-size <disk_size>#
OS disk size in GBs.
- --disk-tier <disk_tier>#
OS disk tier. Could be one of low, medium, high, best, none. If best is specified, use the best possible disk tier. If none is specified, enforce to use default value and override the option in task YAML. Default: medium
- Options:
low | medium | high | best | none
- --use-spot, --no-use-spot#
Whether to request spot instances. If specified, overrides the “resources.use_spot” config.
- --image-id <image_id>#
Custom image id for launching the instances. Passing “none” resets the config.
- --env-file <env_file>#
Path to a dotenv file with environment variables to set on the remote node.
If any values from
--env-file
conflict with values set by--env
, the--env
value will be preferred.
- --env <env>#
Environment variable to set on the remote node. It can be specified multiple times. Examples:
1.--env MY_ENV=1
: set$MY_ENV
on the cluster to be 1.2.
--env MY_ENV2=$HOME
: set$MY_ENV2
on the cluster to be the same value of$HOME
in the local environment where the CLI command is run.3.
--env MY_ENV3
: set$MY_ENV3
on the cluster to be the same value of$MY_ENV3
in the local environment.
- --gpus <gpus>#
Type and number of GPUs to use. Example values: “V100:8”, “V100” (short for a count of 1), or “V100:0.5” (fractional counts are supported by the scheduling framework). If a new cluster is being launched by this command, this is the resources to provision. If an existing cluster is being reused, this is seen as the task demand, which must fit the cluster’s total resources and is used for scheduling the task. Overrides the “accelerators” config in the YAML if both are supplied. Passing “none” resets the config.
- -t, --instance-type <instance_type>#
The instance type to use. If specified, overrides the “resources.instance_type” config. Passing “none” resets the config.
- --ports <ports>#
Ports to open on the cluster. If specified, overrides the “ports” config in the YAML.
- -y, --yes#
Skip confirmation prompt.
Arguments
- SERVICE_YAML#
Required argument(s)
sky serve down#
Teardown service(s).
SERVICE_NAMES is the name of the service (or glob pattern) to tear down. If
both SERVICE_NAMES and --all
are supplied, the latter takes precedence.
Tearing down a service will delete all of its replicas and associated resources.
Example:
# Tear down a specific service.
sky serve down my-service
# Tear down multiple services.
sky serve down my-service1 my-service2
# Tear down all services matching glob pattern 'service-*'.
sky serve down "service-*"
# Tear down all existing services.
sky serve down -a
# Forcefully tear down a service in failed status.
sky serve down failed-service --purge
sky serve down [OPTIONS] [SERVICE_NAMES]...
Options
- -a, --all#
Tear down all services.
- -p, --purge#
Tear down services in failed status.
- -y, --yes#
Skip confirmation prompt.
Arguments
- SERVICE_NAMES#
Optional argument(s)
sky serve status#
Show statuses of SkyServe services.
Show detailed statuses of one or more services. If SERVICE_NAME is not provided, show all services’ status. If –endpoint is specified, output the endpoint of the service only.
Each service can have one of the following statuses:
CONTROLLER_INIT
: The controller is initializing.REPLICA_INIT
: The controller has finished initializing, and there are no ready replicas for now. This also indicates that no replica failure has been detected.CONTROLLER_FAILED
: The controller failed to start or is in an abnormal state; or the controller and load balancer processes are not alive.READY
: The service is ready to serve requests. At least one replica is in READY state (i.e., has passed the readiness probe).SHUTTING_DOWN
: The service is being shut down. This usually happens when the sky serve down command is called.FAILED
: At least one replica failed and no replica is ready. This could be caused by several reasons:The launching process of the replica failed.
No readiness probe passed within initial delay seconds.
The replica continuously failed after serving requests for a while.
User code failed.
FAILED_CLEANUP
: Some error occurred while the service was being shut down. This usually indicates resource leakages. If you see such status, please login to the cloud console and double-checkNO_REPLICAS
: The service has no replicas. This usually happens whenmin_replicas is set to 0 and there is no traffic to the system.
Each replica can have one of the following statuses:
PENDING
: The maximum number of simultaneous launches has been reached and the replica launch process is pending.PROVISIONING
: The replica is being provisioned.STARTING
: Replica provisioning has succeeded and the replica is initializing, e.g., installing dependencies or loading model weights.READY
: The replica is ready to serve requests (i.e., has passed the readiness probe).NOT_READY
: The replica failed a readiness probe, but has not failed the probe for a continuous period of time (otherwise it’d be shut down). This usually happens when the replica is suffering from a bad network connection or there are too many requests overwhelming the replica.SHUTTING_DOWN
: The replica is being shut down. This usually happens when the replica is being scaled down, some error occurred, or the sky serve down command is called. SkyServe will terminate all replicas that errored.FAILED
: Some error occurred when the replica is serving requests. This indicates that the replica is already shut down. (Otherwise, it isSHUTTING_DOWN
.)FAILED_CLEANUP
: Some error occurred while the replica was being shut down. This usually indicates resource leakages since the termination did not finish correctly. When seeing this status, please login to the cloud console and check whether there are some leaked VMs/resources.PREEMPTED
: The replica was preempted by the cloud provider and sky serve is recovering this replica. This only happens when the replica is a spot instance.
Examples:
# Show status for all services
sky serve status
# Show detailed status for all services
sky serve status -a
# Only show status of my-service
sky serve status my-service
sky serve status [OPTIONS] [SERVICE_NAMES]...
Options
- -a, --all#
Show all information in full.
- --endpoint#
Show service endpoint.
Arguments
- SERVICE_NAMES#
Optional argument(s)
sky serve logs#
Tail the log of a service.
Example:
# Tail the controller logs of a service
sky serve logs --controller [SERVICE_NAME]
# Print the load balancer logs so far and exit
sky serve logs --load-balancer --no-follow [SERVICE_NAME]
# Tail the logs of replica 1
sky serve logs [SERVICE_NAME] 1
sky serve logs [OPTIONS] SERVICE_NAME [REPLICA_ID]
Options
- --follow, --no-follow#
Follow the logs of the job. [default: –follow] If –no-follow is specified, print the log so far and exit.
- --controller#
Show the controller logs of this service.
- --load-balancer#
Show the load balancer logs of this service.
Arguments
- SERVICE_NAME#
Required argument
- REPLICA_ID#
Optional argument
Managed Spot Jobs CLI#
sky spot launch#
Launch a managed spot job from a YAML or a command.
If ENTRYPOINT points to a valid YAML file, it is read in as the task specification. Otherwise, it is interpreted as a bash command.
Examples:
# You can use normal task YAMLs.
sky spot launch task.yaml
sky spot launch 'echo hello!'
sky spot launch [OPTIONS] ENTRYPOINT...
Options
- -n, --name <name>#
Task name. Overrides the “name” config in the YAML if both are supplied.
- --workdir <workdir>#
If specified, sync this dir to the remote working directory, where the task will be invoked. Overrides the “workdir” config in the YAML if both are supplied.
- --cloud <cloud>#
The cloud to use. If specified, overrides the “resources.cloud” config. Passing “none” resets the config.
- --region <region>#
The region to use. If specified, overrides the “resources.region” config. Passing “none” resets the config.
- --zone <zone>#
The zone to use. If specified, overrides the “resources.zone” config. Passing “none” resets the config.
- --num-nodes <num_nodes>#
Number of nodes to execute the task on. Overrides the “num_nodes” config in the YAML if both are supplied.
- --cpus <cpus>#
Number of vCPUs each instance must have (e.g.,
--cpus=4
(exactly 4) or--cpus=4+
(at least 4)). This is used to automatically select the instance type.
- --memory <memory>#
Amount of memory each instance must have in GB (e.g.,
--memory=16
(exactly 16GB),--memory=16+
(at least 16GB))
- --disk-size <disk_size>#
OS disk size in GBs.
- --disk-tier <disk_tier>#
OS disk tier. Could be one of low, medium, high, best, none. If best is specified, use the best possible disk tier. If none is specified, enforce to use default value and override the option in task YAML. Default: medium
- Options:
low | medium | high | best | none
- --use-spot, --no-use-spot#
Whether to request spot instances. If specified, overrides the “resources.use_spot” config.
- --image-id <image_id>#
Custom image id for launching the instances. Passing “none” resets the config.
- --env-file <env_file>#
Path to a dotenv file with environment variables to set on the remote node.
If any values from
--env-file
conflict with values set by--env
, the--env
value will be preferred.
- --env <env>#
Environment variable to set on the remote node. It can be specified multiple times. Examples:
1.--env MY_ENV=1
: set$MY_ENV
on the cluster to be 1.2.
--env MY_ENV2=$HOME
: set$MY_ENV2
on the cluster to be the same value of$HOME
in the local environment where the CLI command is run.3.
--env MY_ENV3
: set$MY_ENV3
on the cluster to be the same value of$MY_ENV3
in the local environment.
- --gpus <gpus>#
Type and number of GPUs to use. Example values: “V100:8”, “V100” (short for a count of 1), or “V100:0.5” (fractional counts are supported by the scheduling framework). If a new cluster is being launched by this command, this is the resources to provision. If an existing cluster is being reused, this is seen as the task demand, which must fit the cluster’s total resources and is used for scheduling the task. Overrides the “accelerators” config in the YAML if both are supplied. Passing “none” resets the config.
- -t, --instance-type <instance_type>#
The instance type to use. If specified, overrides the “resources.instance_type” config. Passing “none” resets the config.
- --ports <ports>#
Ports to open on the cluster. If specified, overrides the “ports” config in the YAML.
- --spot-recovery <spot_recovery>#
Spot recovery strategy to use for the managed spot task.
- -d, --detach-run#
If True, as soon as a job is submitted, return from this call and do not stream execution logs.
- -r, --retry-until-up, -no-r, --no-retry-until-up#
(Default: True; this flag is deprecated and will be removed in a future release.) Whether to retry provisioning infinitely until the cluster is up, if unavailability errors are encountered. This applies to launching the spot clusters (both the initial and any recovery attempts), not the spot controller.
- -y, --yes#
Skip confirmation prompt.
Arguments
- ENTRYPOINT#
Required argument(s)
sky spot queue#
Show statuses of managed spot jobs.
Each spot job can have one of the following statuses:
PENDING
: Job is waiting for a free slot on the spot controller to be accepted.SUBMITTED
: Job is submitted to and accepted by the spot controller.STARTING
: Job is starting (provisioning a spot cluster).RUNNING
: Job is running.RECOVERING
: The spot cluster is recovering from a preemption.SUCCEEDED
: Job succeeded.CANCELLING
: Job was requested to be cancelled by the user, and the cancellation is in progress.CANCELLED
: Job was cancelled by the user.FAILED
: Job failed due to an error from the job itself.FAILED_SETUP
: Job failed due to an error from the job’ssetup
commands.FAILED_PRECHECKS
: Job failed due to an error from our prechecks such as invalid cluster names or an infeasible resource is specified.FAILED_NO_RESOURCE
: Job failed due to resources being unavailable after a maximum number of retries.FAILED_CONTROLLER
: Job failed due to an unexpected error in the spot controller.
If the job failed, either due to user code or spot unavailability, the
error log can be found with sky spot logs --controller
, e.g.:
sky spot logs --controller job_id
This also shows the logs for provisioning and any preemption and recovery attempts.
(Tip) To fetch job statuses every 60 seconds, use watch
:
watch -n60 sky spot queue
sky spot queue [OPTIONS]
Options
- -a, --all#
Show all information in full.
- -r, --refresh#
Query the latest statuses, restarting the spot controller if stopped.
- -s, --skip-finished#
Show only pending/running jobs’ information.
sky spot cancel#
Cancel managed spot jobs.
You can provide either a job name or a list of job IDs to be cancelled. They are exclusive options.
Examples:
# Cancel managed spot job with name 'my-job'
$ sky spot cancel -n my-job
# Cancel managed spot jobs with IDs 1, 2, 3
$ sky spot cancel 1 2 3
sky spot cancel [OPTIONS] [JOB_IDS]...
Options
- -n, --name <name>#
Managed spot job name to cancel.
- -a, --all#
Cancel all managed spot jobs.
- -y, --yes#
Skip confirmation prompt.
Arguments
- JOB_IDS#
Optional argument(s)
sky spot logs#
Tail the log of a managed spot job.
sky spot logs [OPTIONS] [JOB_ID]
Options
- -n, --name <name>#
Managed spot job name.
- --follow, --no-follow#
Follow the logs of the job. [default: –follow] If –no-follow is specified, print the log so far and exit.
- --controller#
Show the controller logs of this job; useful for debugging launching/recoveries, etc.
Arguments
- JOB_ID#
Optional argument
Storage CLI#
sky storage ls#
List storage objects managed by SkyPilot.
sky storage ls [OPTIONS]
Options
- -a, --all#
Show all information in full.
sky storage delete#
Delete storage objects.
Examples:
# Delete two storage objects.
sky storage delete imagenet cifar10
# Delete all storage objects matching glob pattern 'imagenet*'.
sky storage delete "imagenet*"
# Delete all storage objects.
sky storage delete -a
sky storage delete [OPTIONS] [NAMES]...
Options
- -a, --all#
Delete all storage objects.
- -y, --yes#
Skip confirmation prompt.
Arguments
- NAMES#
Optional argument(s)
Utils: show-gpus
/check
/cost-report
#
sky show-gpus#
Show supported GPU/TPU/accelerators and their prices.
The names and counts shown can be set in the accelerators
field in task
YAMLs, or in the --gpus
flag in CLI commands. For example, if this
table shows 8x V100s are supported, then the string V100:8
will be
accepted by the above.
To show the detailed information of a GPU/TPU type (its price, which clouds
offer it, the quantity in each VM type, etc.), use sky show-gpus <gpu>
.
To show all accelerators, including less common ones and their detailed
information, use sky show-gpus --all
.
To show all regions for a specified accelerator, use
sky show-gpus <accelerator> --all-regions
.
Definitions of certain fields:
DEVICE_MEM
: Memory of a single device; does not depend on the device count of the instance (VM).HOST_MEM
: Memory of the host instance (VM).
If --region
or --all-regions
is not specified, the price displayed
for each instance type is the lowest across all regions for both on-demand
and spot instances. There may be multiple regions with the same lowest
price.
sky show-gpus [OPTIONS] [ACCELERATOR_STR]
Options
- -a, --all#
Show details of all GPU/TPU/accelerator offerings.
- --cloud <cloud>#
Cloud provider to query.
- --region <region>#
The region to use. If not specified, shows accelerators from all regions.
- --all-regions#
Show pricing and instance details for a specified accelerator across all regions and clouds.
Arguments
- ACCELERATOR_STR#
Optional argument
sky check#
Check which clouds are available to use.
This checks access credentials for all clouds supported by SkyPilot. If a cloud is detected to be inaccessible, the reason and correction steps will be shown.
The enabled clouds are cached and form the “search space” to be considered for each task.
sky check [OPTIONS]
Options
- -v, --verbose#
Show the activated account for each cloud.
sky cost-report#
Show estimated costs for launched clusters.
For each cluster, this shows: cluster name, resources, launched time, duration that cluster was up, and total estimated cost.
The estimated cost column indicates the price for the cluster based on the type of resources being used and the duration of use up until now. This means if the cluster is UP, successive calls to cost-report will show increasing price.
This CLI is experimental. The estimated cost is calculated based on the local cache of the cluster status, and may not be accurate for:
Clusters with autostop/use_spot set; or
Clusters that were terminated/stopped on the cloud console.
sky cost-report [OPTIONS]
Options
- -a, --all#
Show all information in full.