Python API#
SkyPilot offers a programmatic API in Python, which is used under the hood by the CLI.
Note
The Python API contains more experimental functions/classes than the CLI. That said, it has been used to develop several Python libraries by users.
For questions or request for support, please reach out to the development team. Your feedback is much appreciated in evolving this API!
Core API#
sky.launch#
- sky.launch(task, cluster_name=None, retry_until_up=False, idle_minutes_to_autostop=None, dryrun=False, down=False, stream_logs=True, backend=None, optimize_target=OptimizeTarget.COST, detach_setup=False, detach_run=False, no_setup=False, clone_disk_from=None, _is_launched_by_jobs_controller=False, _is_launched_by_sky_serve_controller=False, _disable_controller_check=False)[source]#
Launch a cluster or task.
The task’s setup and run commands are executed under the task’s workdir (when specified, it is synced to remote cluster). The task undergoes job queue scheduling on the cluster.
Currently, the first argument must be a sky.Task, or (EXPERIMENTAL advanced usage) a sky.Dag. In the latter case, currently it must contain a single task; support for pipelines/general DAGs are in experimental branches.
- Parameters:
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch.cluster_name (
Optional
[str
]) – name of the cluster to create/reuse. If None, auto-generate a name.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch(..., detach_run=True, ...)
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.dryrun (
bool
) – if True, do not actually launch the cluster.stream_logs (
bool
) – if True, show the logs in the terminal.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).optimize_target (
OptimizeTarget
) – target to optimize for. Choices: OptimizeTarget.COST, OptimizeTarget.TIME.detach_setup (
bool
) – If True, run setup in non-interactive mode as part of the job itself. You can safely ctrl-c to detach from logging, and it will not interrupt the setup process. To see the logs again after detaching, use sky logs. To cancel setup, cancel the job via sky cancel. Useful for long-running setup commands.detach_run (
bool
) – If True, as soon as a job is submitted, return from this function and do not stream execution logs.no_setup (
bool
) – if True, do not re-run setup commands.clone_disk_from (
Optional
[str
]) – [Experimental] if set, clone the disk from the specified cluster. This is useful to migrate the cluster to a different availability zone or region.
Example
import sky task = sky.Task(run='echo hello SkyPilot') task.set_resources( sky.Resources(cloud=sky.AWS(), accelerators='V100:4')) sky.launch(task, cluster_name='my-cluster')
- Raises:
exceptions.ClusterOwnerIdentityMismatchError – if the cluster is owned by another user.
exceptions.InvalidClusterNameError – if the cluster name is invalid.
exceptions.ResourcesMismatchError – if the requested resources do not match the existing cluster.
exceptions.NotSupportedError – if required features are not supported by the backend/cloud/cluster.
exceptions.ResourcesUnavailableError – if the requested resources cannot be satisfied. The failover_history of the exception will be set as: 1. Empty: iff the first-ever sky.optimize() fails to find a feasible resource; no pre-check or actual launch is attempted. 2. Non-empty: iff at least 1 exception from either our pre-checks (e.g., cluster name invalid) or a region/zone throwing resource unavailability.
exceptions.CommandError – any ssh command error.
exceptions.NoCloudAccessError – if all clouds are disabled.
Other exceptions may be raised depending on the backend.
- Returns:
- Optional[int]; the job ID of the submitted job. None if the
backend is not CloudVmRayBackend, or no job is submitted to the cluster.
- handle: Optional[backends.ResourceHandle]; the handle to the cluster. None
if dryrun.
- Return type:
job_id
sky.exec#
- sky.exec(task, cluster_name, dryrun=False, down=False, stream_logs=True, backend=None, detach_run=False)[source]#
Execute a task on an existing cluster.
This function performs two actions:
workdir syncing, if the task has a workdir specified;
executing the task’s
run
commands.
All other steps (provisioning, setup commands, file mounts syncing) are skipped. If any of those specifications changed in the task, this function will not reflect those changes. To ensure a cluster’s setup is up to date, use
sky.launch()
instead.Execution and scheduling behavior:
The task will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.
The task is run under the workdir (if specified).
The task is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as
htop
do not work. Usessh my_cluster
instead.
- Parameters:
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) containing the task to execute.cluster_name (
str
) – name of an existing cluster to execute the task.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.dryrun (
bool
) – if True, do not actually execute the task.stream_logs (
bool
) – if True, show the logs in the terminal.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).detach_run (
bool
) – if True, detach from logging once the task has been submitted.
- Raises:
ValueError – if the specified cluster does not exist or is not in UP status.
sky.exceptions.NotSupportedError – if the specified cluster is a controller that does not support this operation.
- Returns:
- Optional[int]; the job ID of the submitted job. None if the
backend is not CloudVmRayBackend, or no job is submitted to the cluster.
- handle: Optional[backends.ResourceHandle]; the handle to the cluster. None
if dryrun.
- Return type:
job_id
sky.stop#
- sky.stop(cluster_name, purge=False)[source]#
Stop a cluster.
Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.
Currently, spot instance clusters cannot be stopped (except for GCP, which does allow disk contents to be preserved when stopping spot VMs).
- Parameters:
cluster_name (
str
) – name of the cluster to stop.purge (
bool
) – (Advanced) Forcefully mark the cluster as stopped in SkyPilot’s cluster table, even if the actual cluster stop operation failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.
- Raises:
ValueError – the specified cluster does not exist.
RuntimeError – failed to stop the cluster.
sky.exceptions.NotSupportedError – if the specified cluster is a spot cluster, or a TPU VM Pod cluster, or the managed jobs controller.
- Return type:
sky.start#
- sky.start(cluster_name, idle_minutes_to_autostop=None, retry_until_up=False, down=False, force=False)[source]#
Restart a cluster.
If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this function will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.
Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.
If a cluster is already in the UP status, this function has no effect.
- Parameters:
cluster_name (
str
) – name of the cluster to start.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch(..., detach_run=True, ...)
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.down (
bool
) – Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requiresidle_minutes_to_autostop
to be set.force (
bool
) – whether to force start the cluster even if it is already up. Useful for upgrading SkyPilot runtime.
- Raises:
ValueError – argument values are invalid: (1) the specified cluster does not exist; (2) if
down
is set to True butidle_minutes_to_autostop
is None; (3) if the specified cluster is the managed jobs controller, and eitheridle_minutes_to_autostop
is not None ordown
is True (omit them to use the default autostop settings).sky.exceptions.NotSupportedError – if the cluster to restart was launched using a non-default backend that does not support this operation.
sky.exceptions.ClusterOwnerIdentitiesMismatchError – if the cluster to restart was launched by a different user.
- Return type:
CloudVmRayResourceHandle
sky.down#
- sky.down(cluster_name, purge=False)[source]#
Tear down a cluster.
Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.
- Parameters:
cluster_name (
str
) – name of the cluster to down.purge (
bool
) – (Advanced) Forcefully remove the cluster from SkyPilot’s cluster table, even if the actual cluster termination failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.
- Raises:
ValueError – the specified cluster does not exist.
RuntimeError – failed to tear down the cluster.
sky.exceptions.NotSupportedError – the specified cluster is the managed jobs controller.
- Return type:
sky.status#
- sky.status(cluster_names=None, refresh=False)[source]#
Get cluster statuses.
If cluster_names is given, return those clusters. Otherwise, return all clusters.
Each returned value has the following fields:
{ 'name': (str) cluster name, 'launched_at': (int) timestamp of last launch on this cluster, 'handle': (ResourceHandle) an internal handle to the cluster, 'last_use': (str) the last command/entrypoint that affected this cluster, 'status': (sky.ClusterStatus) cluster status, 'autostop': (int) idle time before autostop, 'to_down': (bool) whether autodown is used instead of autostop, 'metadata': (dict) metadata of the cluster, }
Each cluster can have one of the following statuses:
INIT
: The cluster may be live or down. It can happen in the following cases:Ongoing provisioning or runtime setup. (A
sky.launch()
has started but has not completed.)Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try
sky launch
again on it.)
UP
: Provisioning and runtime setup have succeeded and the cluster is live. (The most recentsky.launch()
has completed successfully.)STOPPED
: The cluster is stopped and the storage is persisted. Usesky.start()
to restart the cluster.
Autostop column:
The autostop column indicates how long the cluster will be autostopped after minutes of idling (no jobs running). If
to_down
is True, the cluster will be autodowned, rather than autostopped.
Getting up-to-date cluster statuses:
In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.
In cases where the clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use
refresh=True
to query the latest cluster statuses from the cloud providers.
- Parameters:
- Return type:
- Returns:
A list of dicts, with each dict containing the information of a cluster. If a cluster is found to be terminated or not found, it will be omitted from the returned list.
sky.autostop#
- sky.autostop(cluster_name, idle_minutes, down=False)[source]#
Schedule an autostop/autodown for a cluster.
Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.
Idleness time of a cluster is reset to zero, whenever:
A job is submitted (
sky.launch()
orsky.exec()
).The cluster has restarted.
An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.
Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.
When multiple autostop settings are specified for the same cluster, the last setting takes precedence.
- Parameters:
cluster_name (
str
) – name of the cluster.idle_minutes (
int
) – the number of minutes of idleness (no pending/running jobs) after which the cluster will be stopped automatically. Setting to a negative number cancels any autostop/autodown setting.down (
bool
) – if true, use autodown (tear down the cluster; non-restartable), rather than autostop (restartable).
- Raises:
ValueError – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend or the cluster is TPU VM Pod.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
- Return type:
Task#
- class sky.Task(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None, event_callback=None, blocked_resources=None)[source]#
Task: a computation to be run on the cloud.
- __init__(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None, event_callback=None, blocked_resources=None)[source]#
Initializes a Task.
All fields are optional.
Task.run
is the actual program: either a shell command to run (str) or a command generator for different nodes (lambda; see below).Optionally, call
Task.set_resources()
to set the resource requirements for this task. If not set, a default CPU-only requirement is assumed (the same assky launch
).All setters of this class,
Task.set_*()
, returnself
, i.e., they are fluent APIs and can be chained together.Example
# A Task that will sync up local workdir '.', containing # requirements.txt and train.py. sky.Task(setup='pip install requirements.txt', run='python train.py', workdir='.') # An empty Task for provisioning a cluster. task = sky.Task(num_nodes=n).set_resources(...) # Chaining setters. sky.Task().set_resources(...).set_file_mounts(...)
- Parameters:
name (
Optional
[str
]) – A string name for the Task for display purposes.setup (
Optional
[str
]) – A setup command, which will be run before executing the run commandsrun
, and executed underworkdir
.run (
Union
[str
,Callable
[[int
,List
[str
]],Optional
[str
]],None
]) – The actual command for the task. If not None, either a shell command (str) or a command generator (callable). If latter, it must take a node rank and a list of node addresses as input and return a shell command (str) (valid to return None for some nodes, in which case no commands are run on them). Run commands will be run underworkdir
. Note the command generator should be a self-contained lambda.envs (
Optional
[Dict
[str
,str
]]) – A dictionary of environment variables to set before running the setup and run commands.workdir (
Optional
[str
]) – The local working directory. This directory will be synced to a location on the remote VM(s), andsetup
andrun
commands will be run under that location (thus, they can rely on relative paths when invoking binaries).num_nodes (
Optional
[int
]) – The number of nodes to provision for this Task. If None, treated as 1 node. If > 1, each node will execute its own setup/run command, whererun
can either be a str, meaning all nodes get the same command, or a lambda, with the semantics documented above.docker_image (
Optional
[str
]) – (EXPERIMENTAL: Only in effect when LocalDockerBackend is used.) The base docker image that this Task will be built on. Defaults to ‘gpuci/miniforge-cuda:11.4-devel-ubuntu18.04’.blocked_resources (
Optional
[Iterable
[Resources
]]) – A set of resources that this task cannot run on.
- static from_yaml(yaml_path)[source]#
Initializes a task from a task YAML.
Example
task = sky.Task.from_yaml('/path/to/task.yaml')
- Parameters:
yaml_path (
str
) – file path to a valid task yaml file.- Raises:
ValueError – if the path gets loaded into a str instead of a dict; or if there are any other parsing errors.
- Return type:
- set_resources(resources)[source]#
Sets the required resources to execute this task.
If this function is not called for a Task, default resource requirements will be used (8 vCPUs).
- Parameters:
resources (
Union
[Resources
,List
[Resources
],Set
[Resources
]]) – either a sky.Resources, a set of them, or a list of them. A set or a list of resources asks the optimizer to “pick the best of these resources” to run this task.- Returns:
The current task, with resources set.
- Return type:
self
- set_resources_override(override_params)[source]#
Sets the override parameters for the resources.
- Return type:
- set_service(service)[source]#
Sets the service spec for this task.
- Parameters:
service (
Optional
[SkyServiceSpec
]) – a SkyServiceSpec object.- Returns:
The current task, with service set.
- Return type:
self
- set_file_mounts(file_mounts)[source]#
Sets the file mounts for this task.
Useful for syncing datasets, dotfiles, etc.
File mounts are a dictionary:
{remote_path: local_path/cloud URI}
. Local (or cloud) files/directories will be synced to the specified paths on the remote VM(s) where this Task will run.Neither source or destimation paths can end with a slash.
Example
task.set_file_mounts({ '~/.dotfile': '/local/.dotfile', # /remote/dir/ will contain the contents of /local/dir/. '/remote/dir': '/local/dir', })
- Parameters:
file_mounts (
Optional
[Dict
[str
,str
]]) – an optional dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Returns:
the current task, with file mounts set.
- Return type:
self
- Raises:
ValueError – if input paths are invalid.
- update_file_mounts(file_mounts)[source]#
Updates the file mounts for this task.
Different from set_file_mounts(), this function updates into the existing file_mounts (calls
dict.update()
), rather than overwritting it.This should be called before provisioning in order to take effect.
Example
task.update_file_mounts({ '~/.config': '~/Documents/config', '/tmp/workdir': '/local/workdir/cnn-cifar10', })
- Parameters:
file_mounts (
Dict
[str
,str
]) – a dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Returns:
the current task, with file mounts updated.
- Return type:
self
- Raises:
ValueError – if input paths are invalid.
- set_storage_mounts(storage_mounts)[source]#
Sets the storage mounts for this task.
Storage mounts are a dictionary:
{mount_path: sky.Storage object}
, each of which mounts a sky.Storage object (a cloud object store bucket) to a path inside the remote cluster.A sky.Storage object can be created by uploading from a local directory (setting
source
), or backed by an existing cloud bucket (settingname
to the bucket name; or settingsource
to the bucket URI).Example
task.set_storage_mounts({ '/remote/imagenet/': sky.Storage(name='my-bucket', source='/local/imagenet'), })
- Parameters:
storage_mounts (
Optional
[Dict
[str
,Storage
]]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Returns:
The current task, with storage mounts set.
- Return type:
self
- Raises:
ValueError – if input paths are invalid.
- update_storage_mounts(storage_mounts)[source]#
Updates the storage mounts for this task.
Different from set_storage_mounts(), this function updates into the existing storage_mounts (calls
dict.update()
), rather than overwriting it.This should be called before provisioning in order to take effect.
- Parameters:
storage_mounts (
Dict
[str
,Storage
]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Returns:
The current task, with storage mounts updated.
- Return type:
self
- Raises:
ValueError – if input paths are invalid.
Resources#
- class sky.Resources(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, ports=None, labels=None, _docker_login_config=None, _is_image_managed=None, _requires_fuse=None)[source]#
Resources: compute requirements of Tasks.
This class is immutable once created (to ensure some validations are done whenever properties change). To update the property of an instance of Resources, use resources.copy(**new_properties).
Used:
for representing resource requests for tasks/apps
as a “filter” to get concrete launchable instances
for calculating billing
for provisioning on a cloud
- __init__(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, ports=None, labels=None, _docker_login_config=None, _is_image_managed=None, _requires_fuse=None)[source]#
Initialize a Resources object.
All fields are optional.
Resources.is_launchable
decides whether the Resources is fully specified to launch an instance.Examples
# Fully specified cloud and instance type (is_launchable() is True). sky.Resources(clouds.AWS(), 'p3.2xlarge') sky.Resources(clouds.GCP(), 'n1-standard-16') sky.Resources(clouds.GCP(), 'n1-standard-8', 'V100') # Specifying required resources; the system decides the # cloud/instance type. The below are equivalent: sky.Resources(accelerators='V100') sky.Resources(accelerators='V100:1') sky.Resources(accelerators={'V100': 1}) sky.Resources(cpus='2+', memory='16+', accelerators='V100')
- Parameters:
cloud (
Optional
[Cloud
]) – the cloud to use.cpus (
Union
[None
,int
,float
,str
]) – the number of CPUs required for the task. If a str, must be a string of the form'2'
or'2+'
, where the+
indicates that the task requires at least 2 CPUs.memory (
Union
[None
,int
,float
,str
]) – the amount of memory in GiB required. If a str, must be a string of the form'16'
or'16+'
, where the+
indicates that the task requires at least 16 GB of memory.accelerators (
Union
[None
,str
,Dict
[str
,int
]]) – the accelerators required. If a str, must be a string of the form'V100'
or'V100:2'
, where the:2
indicates that the task requires 2 V100 GPUs. If a dict, must be a dict of the form{'V100': 2}
or{'tpu-v2-8': 1}
.accelerator_args (
Optional
[Dict
[str
,str
]]) – accelerator-specific arguments. For example,{'tpu_vm': True, 'runtime_version': 'tpu-vm-base'}
for TPUs.use_spot (
Optional
[bool
]) – whether to use spot instances. If None, defaults to False.job_recovery (
Optional
[str
]) – the job recovery strategy to use for the managed job to recover the cluster from preemption. Refer to recovery_strategy module # pylint: disable=line-too-long for more details.image_id (
Union
[Dict
[str
,str
],str
,None
]) –the image ID to use. If a str, must be a string of the image id from the cloud, such as AWS:
'ami-1234567890abcdef0'
, GCP:'projects/my-project-id/global/images/my-image-name'
; Or, a image tag provided by SkyPilot, such as AWS:'skypilot:gpu-ubuntu-2004'
. If a dict, must be a dict mapping from region to image ID, such as:{ 'us-west1': 'ami-1234567890abcdef0', 'us-east1': 'ami-1234567890abcdef0' }
disk_tier (
Union
[str
,DiskTier
,None
]) – the disk performance tier to use. If None, defaults to'medium'
.ports (
Union
[int
,str
,List
[str
],Tuple
[str
],None
]) – the ports to open on the instance.labels (
Optional
[Dict
[str
,str
]]) – the labels to apply to the instance. These are useful for assigning metadata that may be used by external tools. Implementation depends on the chosen cloud - On AWS, labels map to instance tags. On GCP, labels map to instance labels. On Kubernetes, labels map to pod labels. On other clouds, labels are not supported and will be ignored._docker_login_config (
Optional
[DockerLoginConfig
]) – the docker configuration to use. This includes the docker username, password, and registry server. If None, skip docker login._requires_fuse (
Optional
[bool
]) – whether the task requires FUSE mounting support. This is used internally by certain cloud implementations to do additional setup for FUSE mounting. This flag also safeguards against using FUSE mounting on existing clusters that do not support it. If None, defaults to False.
- Raises:
ValueError – if some attributes are invalid.
exceptions.NoCloudAccessError – if no public cloud is enabled.