Python SDK#
SkyPilot offers a Python SDK, which is used under the hood by the CLI.
Most SDK calls are asynchronous and return a future (request ID
).
To wait and get the results:
sky.get(request_id): Wait for a request to finish, and get the results or exceptions.
sky.stream_and_get(request_id): Stream the logs of a request, and get the results or exceptions.
To manage asynchronous requests:
sky.api_status(): List all requests and their statuses.
sky.api_cancel(request_id): Cancel a request.
Refer to the Request Returns
and Request Raises
sections of each API for more details.
Note
Upgrading from v0.8 or older: If you upgraded from a version equal to or older than 0.8.0 to any newer version, you need to update your program to adapt to the new asynchronous execution model. See the migration guide for more details.
Clusters SDK#
sky.launch
#
- sky.launch(task, cluster_name=None, retry_until_up=False, idle_minutes_to_autostop=None, dryrun=False, down=False, backend=None, optimize_target=OptimizeTarget.COST, no_setup=False, clone_disk_from=None, fast=False, _need_confirmation=False, _is_launched_by_jobs_controller=False, _is_launched_by_sky_serve_controller=False, _disable_controller_check=False)[source]
Launches a cluster or task.
The task’s setup and run commands are executed under the task’s workdir (when specified, it is synced to remote cluster). The task undergoes job queue scheduling on the cluster.
Currently, the first argument must be a sky.Task, or (EXPERIMENTAL advanced usage) a sky.Dag. In the latter case, currently it must contain a single task; support for pipelines/general DAGs are in experimental branches.
Example
import sky task = sky.Task(run='echo hello SkyPilot') task.set_resources( sky.Resources(cloud=sky.AWS(), accelerators='V100:4')) sky.launch(task, cluster_name='my-cluster')
- Parameters:
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch.cluster_name (
Optional
[str
]) – name of the cluster to create/reuse. If None, auto-generate a name.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch()
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.dryrun (
bool
) – if True, do not actually launch the cluster.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).optimize_target (
OptimizeTarget
) – target to optimize for. Choices: OptimizeTarget.COST, OptimizeTarget.TIME.no_setup (
bool
) – if True, do not re-run setup commands.clone_disk_from (
Optional
[str
]) – [Experimental] if set, clone the disk from the specified cluster. This is useful to migrate the cluster to a different availability zone or region.fast (
bool
) – [Experimental] If the cluster is already up and available, skip provisioning and setup steps._need_confirmation (
bool
) – (Internal only) If True, show the confirmation prompt.
- Return type:
- Returns:
The request ID of the launch request.
- Request Returns:
job_id (Optional[int]) – the job ID of the submitted job. None if the backend is not
CloudVmRayBackend
, or no job is submitted to the cluster.handle (Optional[backends.ResourceHandle]) – the handle to the cluster. None if dryrun.
- Request Raises:
exceptions.ClusterOwnerIdentityMismatchError – if the cluster is owned by another user.
exceptions.InvalidClusterNameError – if the cluster name is invalid.
exceptions.ResourcesMismatchError – if the requested resources do not match the existing cluster.
exceptions.NotSupportedError – if required features are not supported by the backend/cloud/cluster.
exceptions.ResourcesUnavailableError – if the requested resources cannot be satisfied. The failover_history of the exception will be set as:
Empty: iff the first-ever sky.optimize() fails to find a feasible resource; no pre-check or actual launch is attempted.
Non-empty: iff at least 1 exception from either our pre-checks (e.g., cluster name invalid) or a region/zone throwing resource unavailability.
exceptions.CommandError – any ssh command error.
exceptions.NoCloudAccessError – if all clouds are disabled.
Other exceptions may be raised depending on the backend.
sky.stop
#
- sky.stop(cluster_name, purge=False)[source]
Stops a cluster.
Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.
Currently, spot instance clusters cannot be stopped (except for GCP, which does allow disk contents to be preserved when stopping spot VMs).
- Parameters:
cluster_name (
str
) – name of the cluster to stop.purge (
bool
) – (Advanced) Forcefully mark the cluster as stopped in SkyPilot’s cluster table, even if the actual cluster stop operation failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.
- Return type:
- Returns:
The request ID of the stop request.
- Request Returns:
None
- Request Raises:
sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
RuntimeError – failed to stop the cluster.
sky.exceptions.NotSupportedError – if the specified cluster is a spot cluster, or a TPU VM Pod cluster, or the managed jobs controller.
sky.start
#
- sky.start(cluster_name, idle_minutes_to_autostop=None, retry_until_up=False, down=False, force=False)[source]
Restart a cluster.
If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this function will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.
Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.
If a cluster is already in the UP status, this function has no effect.
- Parameters:
cluster_name (
str
) – name of the cluster to start.idle_minutes_to_autostop (
Optional
[int
]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to runningsky.launch()
and thensky.autostop(idle_minutes=<minutes>)
. If not set, the cluster will not be autostopped.retry_until_up (
bool
) – whether to retry launching the cluster until it is up.down (
bool
) – Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requiresidle_minutes_to_autostop
to be set.force (
bool
) – whether to force start the cluster even if it is already up. Useful for upgrading SkyPilot runtime.
- Return type:
- Returns:
The request ID of the start request.
- Request Returns:
None
- Request Raises:
ValueError – argument values are invalid: (1) if
down
is set to True butidle_minutes_to_autostop
is None; (2) if the specified cluster is the managed jobs controller, and eitheridle_minutes_to_autostop
is not None ordown
is True (omit them to use the default autostop settings).sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
sky.exceptions.NotSupportedError – if the cluster to restart was launched using a non-default backend that does not support this operation.
sky.exceptions.ClusterOwnerIdentitiesMismatchError – if the cluster to restart was launched by a different user.
sky.down
#
- sky.down(cluster_name, purge=False)[source]
Tears down a cluster.
Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.
- Parameters:
cluster_name (
str
) – name of the cluster to down.purge (
bool
) – (Advanced) Forcefully remove the cluster from SkyPilot’s cluster table, even if the actual cluster termination failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.
- Return type:
- Returns:
The request ID of the down request.
- Request Returns:
None
- Request Raises:
sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
RuntimeError – failed to tear down the cluster.
sky.exceptions.NotSupportedError – the specified cluster is the managed jobs controller.
sky.status
#
- sky.status(cluster_names=None, refresh=StatusRefreshMode.NONE, all_users=False)[source]
Gets cluster statuses.
If cluster_names is given, return those clusters. Otherwise, return all clusters.
Each cluster can have one of the following statuses:
INIT
: The cluster may be live or down. It can happen in the following cases:Ongoing provisioning or runtime setup. (A
sky.launch()
has started but has not completed.)Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try
sky launch
again on it.)
UP
: Provisioning and runtime setup have succeeded and the cluster is live. (The most recentsky.launch()
has completed successfully.)STOPPED
: The cluster is stopped and the storage is persisted. Usesky.start()
to restart the cluster.
Autostop column:
The autostop column indicates how long the cluster will be autostopped after minutes of idling (no jobs running). If
to_down
is True, the cluster will be autodowned, rather than autostopped.
Getting up-to-date cluster statuses:
In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.
In cases where the clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use
refresh=True
to query the latest cluster statuses from the cloud providers.
- Parameters:
cluster_names (
Optional
[List
[str
]]) – a list of cluster names to query. If not provided, all clusters will be queried.refresh (
StatusRefreshMode
) – whether to query the latest cluster statuses from the cloud provider(s).all_users (
bool
) – whether to include all users’ clusters. By default, only the current user’s clusters are included.
- Return type:
- Returns:
The request ID of the status request.
- Request Returns:
cluster_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a cluster. If a cluster is found to be terminated or not found, it will be omitted from the returned list.
{ 'name': (str) cluster name, 'launched_at': (int) timestamp of last launch on this cluster, 'handle': (ResourceHandle) an internal handle to the cluster, 'last_use': (str) the last command/entrypoint that affected this cluster, 'status': (sky.ClusterStatus) cluster status, 'autostop': (int) idle time before autostop, 'to_down': (bool) whether autodown is used instead of autostop, 'metadata': (dict) metadata of the cluster, 'user_hash': (str) user hash of the cluster owner, 'user_name': (str) user name of the cluster owner, 'resources_str': (str) the resource string representation of the cluster, }
sky.autostop
#
- sky.autostop(cluster_name, idle_minutes, down=False)[source]
Schedules an autostop/autodown for a cluster.
Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.
Idleness time of a cluster is reset to zero, whenever:
A job is submitted (
sky.launch()
orsky.exec()
).The cluster has restarted.
An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.
Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.
When multiple autostop settings are specified for the same cluster, the last setting takes precedence.
- Parameters:
cluster_name (
str
) – name of the cluster.idle_minutes (
int
) – the number of minutes of idleness (no pending/running jobs) after which the cluster will be stopped automatically. Setting to a negative number cancels any autostop/autodown setting.down (
bool
) – if true, use autodown (tear down the cluster; non-restartable), rather than autostop (restartable).
- Return type:
- Returns:
The request ID of the autostop request.
- Request Returns:
None
- Request Raises:
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend or the cluster is TPU VM Pod.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
Jobs SDK#
Cluster jobs SDK#
sky.exec
#
- sky.exec(task, cluster_name=None, dryrun=False, down=False, backend=None)[source]
Executes a task on an existing cluster.
This function performs two actions:
workdir syncing, if the task has a workdir specified;
executing the task’s
run
commands.
All other steps (provisioning, setup commands, file mounts syncing) are skipped. If any of those specifications changed in the task, this function will not reflect those changes. To ensure a cluster’s setup is up to date, use
sky.launch()
instead.Execution and scheduling behavior:
The task will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.
The task is run under the workdir (if specified).
The task is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as
htop
do not work. Usessh my_cluster
instead.
- Parameters:
task (
Union
[Task
,Dag
]) – sky.Task, or sky.Dag (experimental; 1-task only) containing the task to execute.cluster_name (
Optional
[str
]) – name of an existing cluster to execute the task.dryrun (
bool
) – if True, do not actually execute the task.down (
bool
) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.backend (
Optional
[Backend
]) – backend to use. If None, use the default backend (CloudVMRayBackend).
- Return type:
- Returns:
The request ID of the exec request.
- Request Returns:
job_id (Optional[int]) – the job ID of the submitted job. None if the backend is not CloudVmRayBackend, or no job is submitted to the cluster.
handle (Optional[backends.ResourceHandle]) – the handle to the cluster. None if dryrun.
- Request Raises:
ValueError – if the specified cluster is not in UP status.
sky.exceptions.ClusterDoesNotExist – if the specified cluster does not exist.
sky.exceptions.NotSupportedError – if the specified cluster is a controller that does not support this operation.
sky.queue
#
- sky.queue(cluster_name, skip_finished=False, all_users=False)[source]
Gets the job queue of a cluster.
- Parameters:
- Return type:
- Returns:
The request ID of the queue request.
- Request Returns:
job_records (List[Dict[str, Any]]) – A list of dicts for each job in the queue.
[ { 'job_id': (int) job id, 'job_name': (str) job name, 'username': (str) username, 'user_hash': (str) user hash, 'submitted_at': (int) timestamp of submitted, 'start_at': (int) timestamp of started, 'end_at': (int) timestamp of ended, 'resources': (str) resources, 'status': (job_lib.JobStatus) job status, 'log_path': (str) log path, } ]
- Request Raises:
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on
CloudVmRayBackend
.sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
sky.exceptions.CommandError – if failed to get the job queue with ssh.
sky.job_status
#
- sky.job_status(cluster_name, job_ids=None)[source]
Gets the status of jobs on a cluster.
- Parameters:
- Return type:
- Returns:
The request ID of the job status request.
- Request Returns:
job_statuses (Dict[Optional[int], Optional[job_lib.JobStatus]]) – A mapping of job_id to job statuses. The status will be None if the job does not exist. If job_ids is None and there is no job on the cluster, it will return {None: None}.
- Request Raises:
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on
CloudVmRayBackend
.sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
sky.tail_logs
#
- sky.tail_logs(cluster_name, job_id, follow, tail=0, output_stream=None)[source]
Tails the logs of a job.
- Parameters:
cluster_name (
str
) – name of the cluster.follow (
bool
) – if True, follow the logs. Otherwise, return the logs immediately.tail (
int
) – if > 0, tail the last N lines of the logs.output_stream (
Optional
[TextIOBase
]) – the stream to write the logs to. If None, print to the console.
- Return type:
- Returns:
Exit code based on success or failure of the job. 0 if success, 100 if the job failed. See exceptions.JobExitCode for possible exit codes.
- Request Raises:
ValueError – if arguments are invalid or the cluster is not supported.
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
sky.download_logs
#
- sky.download_logs(cluster_name, job_ids)[source]
Downloads the logs of jobs.
- Parameters:
- Return type:
- Returns:
The request ID of the download_logs request.
- Request Returns:
job_log_paths (Dict[str, str]) – a mapping of job_id to local log path.
- Request Raises:
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
sky.cancel
#
- sky.cancel(cluster_name, all=False, all_users=False, job_ids=None, _try_cancel_if_cluster_is_init=False)[source]
Cancels jobs on a cluster.
- Parameters:
cluster_name (
str
) – name of the cluster.all (
bool
) – if True, cancel all jobs.all_users (
bool
) – if True, cancel all jobs from all users.job_ids (
Optional
[List
[int
]]) – a list of job IDs to cancel._try_cancel_if_cluster_is_init (
bool
) – (bool) whether to try cancelling the job even if the cluster is not UP, but the head node is still alive. This is used by the jobs controller to cancel the job when the worker node is preempted in the spot cluster.
- Return type:
- Returns:
The request ID of the cancel request.
- Request Returns:
None
- Request Raises:
ValueError – if arguments are invalid.
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the specified cluster is a controller that does not support this operation.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
Managed jobs SDK#
sky.jobs.launch
#
- sky.jobs.launch(task, name=None, _need_confirmation=False)[source]
Launches a managed job.
Please refer to sky.cli.job_launch for documentation.
- Parameters:
- Return type:
- Returns:
The request ID of the launch request.
- Request Returns:
job_id (Optional[int]) – Job ID for the managed job
controller_handle (Optional[ResourceHandle]) – ResourceHandle of the controller
- Request Raises:
ValueError – cluster does not exist. Or, the entrypoint is not a valid chain dag.
sky.exceptions.NotSupportedError – the feature is not supported.
sky.jobs.queue
#
- sky.jobs.queue(refresh, skip_finished=False, all_users=False)[source]
Gets statuses of managed jobs.
Please refer to sky.cli.job_queue for documentation.
- Parameters:
- Return type:
- Returns:
The request ID of the queue request.
- Request Returns:
job_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a job.
[ { 'job_id': (int) job id, 'job_name': (str) job name, 'resources': (str) resources of the job, 'submitted_at': (float) timestamp of submission, 'end_at': (float) timestamp of end, 'duration': (float) duration in seconds, 'recovery_count': (int) Number of retries, 'status': (sky.jobs.ManagedJobStatus) of the job, 'cluster_resources': (str) resources of the cluster, 'region': (str) region of the cluster, } ]
- Request Raises:
sky.exceptions.ClusterNotUpError – the jobs controller is not up or does not exist.
RuntimeError – if failed to get the managed jobs with ssh.
sky.jobs.cancel
#
- sky.jobs.cancel(name=None, job_ids=None, all=False, all_users=False)[source]
Cancels managed jobs.
Please refer to sky.cli.job_cancel for documentation.
- Parameters:
- Return type:
- Returns:
The request ID of the cancel request.
- Request Raises:
sky.exceptions.ClusterNotUpError – the jobs controller is not up.
RuntimeError – failed to cancel the job.
sky.jobs.tail_logs
#
- sky.jobs.tail_logs(name=None, job_id=None, follow=True, controller=False, refresh=False, output_stream=None)[source]
Tails logs of managed jobs.
You can provide either a job name or a job ID to tail logs. If both are not provided, the logs of the latest job will be shown.
- Parameters:
name (
Optional
[str
]) – Name of the managed job to tail logs.job_id (
Optional
[int
]) – ID of the managed job to tail logs.follow (
bool
) – Whether to follow the logs.controller (
bool
) – Whether to tail logs from the jobs controller.refresh (
bool
) – Whether to restart the jobs controller if it is stopped.output_stream (
Optional
[TextIOBase
]) – The stream to write the logs to. If None, print to the console.
- Return type:
- Returns:
Exit code based on success or failure of the job. 0 if success, 100 if the job failed. See exceptions.JobExitCode for possible exit codes.
- Request Raises:
ValueError – invalid arguments.
sky.exceptions.ClusterNotUpError – the jobs controller is not up.
Serving SDK#
sky.serve.up
#
- sky.serve.up(task, service_name, _need_confirmation=False)[source]
Spins up a service.
Please refer to the sky.cli.serve_up for the document.
- Parameters:
- Return type:
- Returns:
The request ID of the up request.
- Request Returns:
service_name (str) – The name of the service. Same if passed in as an argument.
endpoint (str) – The service endpoint.
sky.serve.update
#
- sky.serve.update(task, service_name, mode, _need_confirmation=False)[source]
Updates an existing service.
Please refer to the sky.cli.serve_update for the document.
- Parameters:
- Return type:
- Returns:
The request ID of the update request.
- Request Returns:
None
sky.serve.down
#
- sky.serve.down(service_names, all=False, purge=False)[source]
Tears down a service.
Please refer to the sky.cli.serve_down for the docs.
- Parameters:
- Return type:
- Returns:
The request ID of the down request.
- Request Returns:
None
- Request Raises:
sky.exceptions.ClusterNotUpError – if the sky serve controller is not up.
ValueError – if the arguments are invalid.
RuntimeError – if failed to terminate the service.
sky.serve.terminate_replica
#
- sky.serve.terminate_replica(service_name, replica_id, purge)[source]
Tears down a specific replica for the given service.
- Parameters:
service_name (
str
) – Name of the service.replica_id (
int
) – ID of replica to terminate.purge (
bool
) – Whether to terminate replicas in a failed status. These replicas may lead to resource leaks, so we require the user to explicitly specify this flag to make sure they are aware of this potential resource leak.
- Return type:
- Returns:
The request ID of the terminate replica request.
- Request Raises:
sky.exceptions.ClusterNotUpError – if the sky sere controller is not up.
RuntimeError – if failed to terminate the replica.
sky.serve.status
#
- sky.serve.status(service_names)[source]
Gets service statuses.
If service_names is given, return those services. Otherwise, return all services.
Each returned value has the following fields:
{ 'name': (str) service name, 'active_versions': (List[int]) a list of versions that are active, 'controller_job_id': (int) the job id of the controller, 'uptime': (int) uptime in seconds, 'status': (sky.ServiceStatus) service status, 'controller_port': (Optional[int]) controller port, 'load_balancer_port': (Optional[int]) load balancer port, 'endpoint': (Optional[str]) endpoint of the service, 'policy': (Optional[str]) autoscaling policy description, 'requested_resources_str': (str) str representation of requested resources, 'load_balancing_policy': (str) load balancing policy name, 'replica_info': (List[Dict[str, Any]]) replica information, }
Each entry in replica_info has the following fields:
{ 'replica_id': (int) replica id, 'name': (str) replica name, 'status': (sky.serve.ReplicaStatus) replica status, 'version': (int) replica version, 'launched_at': (int) timestamp of launched, 'handle': (ResourceHandle) handle of the replica cluster, 'endpoint': (str) endpoint of the replica, }
For possible service statuses and replica statuses, please refer to sky.cli.serve_status.
- Parameters:
service_names (
Union
[List
[str
],str
,None
]) – a single or a list of service names to query. If None, query all services.- Return type:
- Returns:
The request ID of the status request.
- Request Returns:
service_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a service. If a service is not found, it will be omitted from the returned list.
- Request Raises:
RuntimeError – if failed to get the service status.
exceptions.ClusterNotUpError – if the sky serve controller is not up.
sky.serve.tail_logs
#
- sky.serve.tail_logs(service_name, target, replica_id=None, follow=True, output_stream=None)[source]
Tails logs for a service.
Usage:
sky.serve.tail_logs( service_name, target=<component>, follow=False, # Optionally, default to True # replica_id=3, # Must be specified when target is REPLICA. )
target
is a enum ofsky.serve.ServiceComponent
, which can be one of:sky.serve.ServiceComponent.CONTROLLER
sky.serve.ServiceComponent.LOAD_BALANCER
sky.serve.ServiceComponent.REPLICA
Pass target as a lower-case string is also supported, e.g.
target='controller'
. To usesky.serve.ServiceComponent.REPLICA
, you must specifyreplica_id
.To tail controller logs:
# follow default to True sky.serve.tail_logs( service_name, target=sky.serve.ServiceComponent.CONTROLLER )
To print replica 3 logs:
# Pass target as a lower-case string is also supported. sky.serve.tail_logs( service_name, target='replica', follow=False, replica_id=3 )
- Parameters:
service_name (
str
) – Name of the service.target (
Union
[str
,ServiceComponent
]) – The component to tail logs.replica_id (
Optional
[int
]) – The ID of the replica to tail logs.follow (
bool
) – Whether to follow the logs.output_stream (
Optional
[TextIOBase
]) – The stream to write the logs to. If None, print to the console.
- Return type:
- Returns:
The request ID of the tail logs request.
- Request Raises:
sky.exceptions.ClusterNotUpError – the sky serve controller is not up.
ValueError – arguments not valid, or failed to tail the logs.
Task#
- class sky.Task(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None, event_callback=None, blocked_resources=None, file_mounts_mapping=None)[source]#
Task: a computation to be run on the cloud.
- __init__(name=None, *, setup=None, run=None, envs=None, workdir=None, num_nodes=None, docker_image=None, event_callback=None, blocked_resources=None, file_mounts_mapping=None)[source]#
Initializes a Task.
All fields are optional.
Task.run
is the actual program: either a shell command to run (str) or a command generator for different nodes (lambda; see below).Optionally, call
Task.set_resources()
to set the resource requirements for this task. If not set, a default CPU-only requirement is assumed (the same assky launch
).All setters of this class,
Task.set_*()
, returnself
, i.e., they are fluent APIs and can be chained together.Example
# A Task that will sync up local workdir '.', containing # requirements.txt and train.py. sky.Task(setup='pip install requirements.txt', run='python train.py', workdir='.') # An empty Task for provisioning a cluster. task = sky.Task(num_nodes=n).set_resources(...) # Chaining setters. sky.Task().set_resources(...).set_file_mounts(...)
- Parameters:
name (
Optional
[str
]) – A string name for the Task for display purposes.setup (
Optional
[str
]) – A setup command, which will be run before executing the run commandsrun
, and executed underworkdir
.run (
Union
[str
,Callable
[[int
,List
[str
]],Optional
[str
]],None
]) – The actual command for the task. If not None, either a shell command (str) or a command generator (callable). If latter, it must take a node rank and a list of node addresses as input and return a shell command (str) (valid to return None for some nodes, in which case no commands are run on them). Run commands will be run underworkdir
. Note the command generator should be a self-contained lambda.envs (
Optional
[Dict
[str
,str
]]) – A dictionary of environment variables to set before running the setup and run commands.workdir (
Optional
[str
]) – The local working directory. This directory will be synced to a location on the remote VM(s), andsetup
andrun
commands will be run under that location (thus, they can rely on relative paths when invoking binaries).num_nodes (
Optional
[int
]) – The number of nodes to provision for this Task. If None, treated as 1 node. If > 1, each node will execute its own setup/run command, whererun
can either be a str, meaning all nodes get the same command, or a lambda, with the semantics documented above.docker_image (
Optional
[str
]) – (EXPERIMENTAL: Only in effect when LocalDockerBackend is used.) The base docker image that this Task will be built on. Defaults to ‘gpuci/miniforge-cuda:11.4-devel-ubuntu18.04’.blocked_resources (
Optional
[Iterable
[Resources
]]) – A set of resources that this task cannot run on.
- static from_yaml(yaml_path)[source]#
Initializes a task from a task YAML.
Example
task = sky.Task.from_yaml('/path/to/task.yaml')
- Parameters:
yaml_path (
str
) – file path to a valid task yaml file.- Raises:
ValueError – if the path gets loaded into a str instead of a dict; or if there are any other parsing errors.
- Return type:
- set_resources(resources)[source]#
Sets the required resources to execute this task.
If this function is not called for a Task, default resource requirements will be used (8 vCPUs).
- Parameters:
resources (
Union
[Resources
,List
[Resources
],Set
[Resources
]]) – either a sky.Resources, a set of them, or a list of them. A set or a list of resources asks the optimizer to “pick the best of these resources” to run this task.- Return type:
- Returns:
self – The current task, with resources set.
- set_resources_override(override_params)[source]#
Sets the override parameters for the resources.
- Return type:
- set_file_mounts(file_mounts)[source]#
Sets the file mounts for this task.
Useful for syncing datasets, dotfiles, etc.
File mounts are a dictionary:
{remote_path: local_path/cloud URI}
. Local (or cloud) files/directories will be synced to the specified paths on the remote VM(s) where this Task will run.Neither source or destimation paths can end with a slash.
Example
task.set_file_mounts({ '~/.dotfile': '/local/.dotfile', # /remote/dir/ will contain the contents of /local/dir/. '/remote/dir': '/local/dir', })
- Parameters:
file_mounts (
Optional
[Dict
[str
,str
]]) – an optional dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Return type:
- Returns:
self – the current task, with file mounts set.
- update_file_mounts(file_mounts)[source]#
Updates the file mounts for this task.
Different from set_file_mounts(), this function updates into the existing file_mounts (calls
dict.update()
), rather than overwritting it.This should be called before provisioning in order to take effect.
Example
task.update_file_mounts({ '~/.config': '~/Documents/config', '/tmp/workdir': '/local/workdir/cnn-cifar10', })
- Parameters:
file_mounts (
Dict
[str
,str
]) – a dict of{remote_path: local_path/cloud URI}
, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.- Return type:
- Returns:
self – the current task, with file mounts updated.
- Raises:
ValueError – if input paths are invalid.
- set_storage_mounts(storage_mounts)[source]#
Sets the storage mounts for this task.
Storage mounts are a dictionary:
{mount_path: sky.Storage object}
, each of which mounts a sky.Storage object (a cloud object store bucket) to a path inside the remote cluster.A sky.Storage object can be created by uploading from a local directory (setting
source
), or backed by an existing cloud bucket (settingname
to the bucket name; or settingsource
to the bucket URI).Example
task.set_storage_mounts({ '/remote/imagenet/': sky.Storage(name='my-bucket', source='/local/imagenet'), })
- Parameters:
storage_mounts (
Optional
[Dict
[str
,Storage
]]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Return type:
- Returns:
self – The current task, with storage mounts set.
- Raises:
ValueError – if input paths are invalid.
- update_storage_mounts(storage_mounts)[source]#
Updates the storage mounts for this task.
Different from set_storage_mounts(), this function updates into the existing storage_mounts (calls
dict.update()
), rather than overwriting it.This should be called before provisioning in order to take effect.
- Parameters:
storage_mounts (
Dict
[str
,Storage
]) – an optional dict of{mount_path: sky.Storage object}
, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.- Return type:
- Returns:
self – The current task, with storage mounts updated.
- Raises:
ValueError – if input paths are invalid.
Resources#
- class sky.Resources(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, ports=None, labels=None, _docker_login_config=None, _docker_username_for_runpod=None, _is_image_managed=None, _requires_fuse=None, _cluster_config_overrides=None)[source]#
Resources: compute requirements of Tasks.
This class is immutable once created (to ensure some validations are done whenever properties change). To update the property of an instance of Resources, use
resources.copy(**new_properties)
.Used:
for representing resource requests for tasks/apps
as a “filter” to get concrete launchable instances
for calculating billing
for provisioning on a cloud
- __init__(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, ports=None, labels=None, _docker_login_config=None, _docker_username_for_runpod=None, _is_image_managed=None, _requires_fuse=None, _cluster_config_overrides=None)[source]#
Initialize a Resources object.
All fields are optional.
Resources.is_launchable
decides whether the Resources is fully specified to launch an instance.Examples
# Fully specified cloud and instance type (is_launchable() is True). sky.Resources(clouds.AWS(), 'p3.2xlarge') sky.Resources(clouds.GCP(), 'n1-standard-16') sky.Resources(clouds.GCP(), 'n1-standard-8', 'V100') # Specifying required resources; the system decides the # cloud/instance type. The below are equivalent: sky.Resources(accelerators='V100') sky.Resources(accelerators='V100:1') sky.Resources(accelerators={'V100': 1}) sky.Resources(cpus='2+', memory='16+', accelerators='V100')
- Parameters:
cloud (
Optional
[Cloud
]) – the cloud to use.cpus (
Union
[None
,int
,float
,str
]) – the number of CPUs required for the task. If a str, must be a string of the form'2'
or'2+'
, where the+
indicates that the task requires at least 2 CPUs.memory (
Union
[None
,int
,float
,str
]) – the amount of memory in GiB required. If a str, must be a string of the form'16'
or'16+'
, where the+
indicates that the task requires at least 16 GB of memory.accelerators (
Union
[None
,str
,Dict
[str
,int
]]) – the accelerators required. If a str, must be a string of the form'V100'
or'V100:2'
, where the:2
indicates that the task requires 2 V100 GPUs. If a dict, must be a dict of the form{'V100': 2}
or{'tpu-v2-8': 1}
.accelerator_args (
Optional
[Dict
[str
,str
]]) – accelerator-specific arguments. For example,{'tpu_vm': True, 'runtime_version': 'tpu-vm-base'}
for TPUs.use_spot (
Optional
[bool
]) – whether to use spot instances. If None, defaults to False.job_recovery (
Union
[Dict
[str
,Union
[str
,int
]],str
,None
]) –the job recovery strategy to use for the managed job to recover the cluster from preemption. Refer to recovery_strategy module # pylint: disable=line-too-long for more details. When a dict is provided, it can have the following fields:
strategy: the recovery strategy to use.
max_restarts_on_errors: the max number of restarts on user code errors.
image_id (
Union
[Dict
[str
,str
],str
,None
]) –the image ID to use. If a str, must be a string of the image id from the cloud, such as AWS:
'ami-1234567890abcdef0'
, GCP:'projects/my-project-id/global/images/my-image-name'
; Or, a image tag provided by SkyPilot, such as AWS:'skypilot:gpu-ubuntu-2004'
. If a dict, must be a dict mapping from region to image ID, such as:{ 'us-west1': 'ami-1234567890abcdef0', 'us-east1': 'ami-1234567890abcdef0' }
disk_tier (
Union
[str
,DiskTier
,None
]) – the disk performance tier to use. If None, defaults to'medium'
.ports (
Union
[int
,str
,List
[str
],Tuple
[str
],None
]) – the ports to open on the instance.labels (
Optional
[Dict
[str
,str
]]) – the labels to apply to the instance. These are useful for assigning metadata that may be used by external tools. Implementation depends on the chosen cloud - On AWS, labels map to instance tags. On GCP, labels map to instance labels. On Kubernetes, labels map to pod labels. On other clouds, labels are not supported and will be ignored._docker_login_config (
Optional
[DockerLoginConfig
]) – the docker configuration to use. This includes the docker username, password, and registry server. If None, skip docker login._docker_username_for_runpod (
Optional
[str
]) – the login username for the docker containers. This is used by RunPod to set the ssh user for the docker containers._requires_fuse (
Optional
[bool
]) – whether the task requires FUSE mounting support. This is used internally by certain cloud implementations to do additional setup for FUSE mounting. This flag also safeguards against using FUSE mounting on existing clusters that do not support it. If None, defaults to False.
- Raises:
ValueError – if some attributes are invalid.
exceptions.NoCloudAccessError – if no public cloud is enabled.
Enums#
- class sky.ClusterStatus(value)[source]#
Cluster status as recorded in local cache.
This can be different from the actual cluster status, and can be refreshed by running
sky status --refresh
.- INIT = 'INIT'#
Initializing.
This means a provisioning has started but has not successfully finished. The cluster may be undergoing setup, may have failed setup, may be live or down.
- UP = 'UP'#
The cluster is up. This means a provisioning has previously succeeded.
- STOPPED = 'STOPPED'#
The cluster is stopped.
- class sky.JobStatus(value)[source]#
Job status enum.
- INIT = 'INIT'#
The job has been submitted, but not started yet.
- PENDING = 'PENDING'#
The job is waiting for required resources.
- SETTING_UP = 'SETTING_UP'#
The job is running the user’s setup script.
- RUNNING = 'RUNNING'#
The job is running.
- FAILED_DRIVER = 'FAILED_DRIVER'#
The job driver process failed.
- SUCCEEDED = 'SUCCEEDED'#
The job finished successfully.
- FAILED = 'FAILED'#
The job fails due to the user code.
- FAILED_SETUP = 'FAILED_SETUP'#
The job setup failed.
- CANCELLED = 'CANCELLED'#
The job is cancelled by the user.
API server SDK#
sky.get
#
- sky.get(request_id)[source]
Waits for and gets the result of a request.
- Parameters:
request_id (
str
) – The request ID of the request to get.- Return type:
- Returns:
The
Request Returns
of the specified request. See the documentation of the specific requests above for more details.- Raises:
Exception – It raises the same exceptions as the specific requests, see
Request Raises
in the documentation of the specific requests above.
sky.stream_and_get
#
- sky.stream_and_get(request_id=None, log_path=None, tail=None, follow=True, output_stream=None)[source]
Streams the logs of a request or a log file and gets the final result.
This will block until the request is finished. The request id can be a prefix of the full request id.
- Parameters:
request_id (
Optional
[str
]) – The prefix of the request ID of the request to stream.log_path (
Optional
[str
]) – The path to the log file to stream.tail (
Optional
[int
]) – The number of lines to show from the end of the logs. If None, show all logs.follow (
bool
) – Whether to follow the logs.output_stream (
Optional
[TextIOBase
]) – The output stream to write to. If None, print to the console.
- Return type:
- Returns:
The
Request Returns
of the specified request. See the documentation of the specific requests above for more details.- Raises:
Exception – It raises the same exceptions as the specific requests, see
Request Raises
in the documentation of the specific requests above.
sky.api_status
#
sky.api_cancel
#
- sky.api_cancel(request_ids=None, all_users=False, silent=False)[source]
Aborts a request or all requests.
- Parameters:
- Return type:
- Returns:
The request ID of the abort request itself.
- Request Returns:
A list of request IDs that were cancelled.
- Raises:
click.BadParameter – If no request ID is specified and not all or all_users is not set.
sky.api_info
#
sky.api_start
#
- sky.api_start(*, deploy=False, host='127.0.0.1', foreground=False)[source]
Starts the API server.
It checks the existence of the API server and starts it if it does not exist.
- Parameters:
deploy (
bool
) – Whether to deploy the API server, i.e. fully utilize the resources of the machine.host (
str
) – The host to deploy the API server. It will be set to 0.0.0.0 if deploy is True, to allow remote access.foreground (
bool
) – Whether to run the API server in the foreground (run in the current process).
- Return type:
- Returns:
None