Python SDK#

SkyPilot offers a Python SDK, which is used under the hood by the CLI.

Most SDK calls are asynchronous and return a future (request ID).

To wait and get the results:

sky.get(request_id): Wait for a request to finish, and get the results or exceptions.
sky.stream_and_get(request_id): Stream the logs of a request, and get the results or exceptions.

To manage asynchronous requests:

sky.api_status(): List all requests and their statuses.
sky.api_cancel(request_id): Cancel a request.

Refer to the Request Returns and Request Raises sections of each API for more details.

Note

Upgrading from v0.8 or older: If you upgraded from a version equal to or older than 0.8.0 to any newer version, you need to update your program to adapt to the new asynchronous execution model. See the migration guide for more details.

Clusters SDK#

`sky.launch`#

sky.launch(task, cluster_name=None, retry_until_up=False, idle_minutes_to_autostop=None, wait_for=None, dryrun=False, down=False, backend=None, optimize_target=OptimizeTarget.COST, no_setup=False, clone_disk_from=None, fast=False, _need_confirmation=False, _is_launched_by_jobs_controller=False, _is_launched_by_sky_serve_controller=False, _disable_controller_check=False)[source]

Launches a cluster or task.

The task’s setup and run commands are executed under the task’s workdir (when specified, it is synced to remote cluster). The task undergoes job queue scheduling on the cluster.

Currently, the first argument must be a sky.Task, or (EXPERIMENTAL advanced usage) a sky.Dag. In the latter case, currently it must contain a single task; support for pipelines/general DAGs are in experimental branches.

Example

import sky
task = sky.Task(run='echo hello SkyPilot')
task.set_resources(
    sky.Resources(infra='aws', accelerators='V100:4'))
sky.launch(task, cluster_name='my-cluster')

Parameters:

task (Union[Task, Dag]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch.
cluster_name (Optional[str]) – name of the cluster to create/reuse. If None, auto-generate a name.
retry_until_up (bool) – whether to retry launching the cluster until it is up.
idle_minutes_to_autostop (Optional[int]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to running sky.launch(...) and then sky.autostop(idle_minutes=<minutes>). If set, the autostop config specified in the task’ resources will be overridden by this parameter.
wait_for (Optional[AutostopWaitFor]) –
determines the condition for resetting the idleness timer. This option works in conjunction with idle_minutes_to_autostop. Choices:
1. ”jobs_and_ssh” (default) - Wait for in-progress jobs and SSH connections to finish.
2. ”jobs” - Only wait for in-progress jobs.
3. ”none” - Wait for nothing; autostop right after idle_minutes_to_autostop.
dryrun (bool) – if True, do not actually launch the cluster.
down (bool) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes. If set, the autostop config specified in the task’ resources will be overridden by this parameter.
backend (Optional[Backend]) – backend to use. If None, use the default backend (CloudVMRayBackend).
optimize_target (OptimizeTarget) – target to optimize for. Choices: OptimizeTarget.COST, OptimizeTarget.TIME.
no_setup (bool) – if True, do not re-run setup commands.
clone_disk_from (Optional[str]) – [Experimental] if set, clone the disk from the specified cluster. This is useful to migrate the cluster to a different availability zone or region.
fast (bool) – [Experimental] If the cluster is already up and available, skip provisioning and setup steps.
_need_confirmation (bool) – (Internal only) If True, show the confirmation prompt.

Return type:

RequestId[Tuple[Optional[int], Optional[ResourceHandle]]]

Returns:

The request ID of the launch request.

Request Returns:

job_id (Optional[int]) – the job ID of the submitted job. None if the backend is not CloudVmRayBackend, or no job is submitted to the cluster.
handle (Optional[backends.ResourceHandle]) – the handle to the cluster. None if dryrun.

Request Raises:

exceptions.ClusterOwnerIdentityMismatchError – if the cluster is owned by another user.
exceptions.InvalidClusterNameError – if the cluster name is invalid.
exceptions.ResourcesMismatchError – if the requested resources do not match the existing cluster.
exceptions.NotSupportedError – if required features are not supported by the backend/cloud/cluster.
exceptions.ResourcesUnavailableError – if the requested resources cannot be satisfied. The failover_history of the exception will be set as:
1. Empty: iff the first-ever sky.optimize() fails to find a feasible resource; no pre-check or actual launch is attempted.
2. Non-empty: iff at least 1 exception from either our pre-checks (e.g., cluster name invalid) or a region/zone throwing resource unavailability.
exceptions.CommandError – any ssh command error.
exceptions.NoCloudAccessError – if all clouds are disabled.

Other exceptions may be raised depending on the backend.

`sky.stop`#

sky.stop(cluster_name, purge=False)[source]

Stops a cluster.

Data on attached disks is not lost when a cluster is stopped. Billing for the instances will stop, while the disks will still be charged. Those disks will be reattached when restarting the cluster.

Currently, spot instance clusters cannot be stopped (except for GCP, which does allow disk contents to be preserved when stopping spot VMs).

Parameters:

cluster_name (str) – name of the cluster to stop.
purge (bool) – (Advanced) Forcefully mark the cluster as stopped in SkyPilot’s cluster table, even if the actual cluster stop operation failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.

Return type:

RequestId[None]

Returns:

The request ID of the stop request.

Request Returns:

None

Request Raises:

sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
RuntimeError – failed to stop the cluster.
sky.exceptions.NotSupportedError – if the specified cluster is a spot cluster, or a TPU VM Pod cluster, or the managed jobs controller.

`sky.start`#

sky.start(cluster_name, idle_minutes_to_autostop=None, wait_for=None, retry_until_up=False, down=False, force=False)[source]

Restart a cluster.

If a cluster is previously stopped (status is STOPPED) or failed in provisioning/runtime installation (status is INIT), this function will attempt to start the cluster. In the latter case, provisioning and runtime installation will be retried.

Auto-failover provisioning is not used when restarting a stopped cluster. It will be started on the same cloud, region, and zone that were chosen before.

If a cluster is already in the UP status, this function has no effect.

Parameters:

cluster_name (str) – name of the cluster to start.
idle_minutes_to_autostop (Optional[int]) – automatically stop the cluster after this many minute of idleness, i.e., no running or pending jobs in the cluster’s job queue. Idleness gets reset whenever setting-up/ running/pending jobs are found in the job queue. Setting this flag is equivalent to running sky.launch() and then sky.autostop(idle_minutes=<minutes>). If not set, the cluster will not be autostopped.
wait_for (Optional[AutostopWaitFor]) –
determines the condition for resetting the idleness timer. This option works in conjunction with idle_minutes_to_autostop. Choices:
1. ”jobs_and_ssh” (default) - Wait for in-progress jobs and SSH connections to finish.
2. ”jobs” - Only wait for in-progress jobs.
3. ”none” - Wait for nothing; autostop right after idle_minutes_to_autostop.
retry_until_up (bool) – whether to retry launching the cluster until it is up.
down (bool) – Autodown the cluster: tear down the cluster after specified minutes of idle time after all jobs finish (successfully or abnormally). Requires idle_minutes_to_autostop to be set.
force (bool) – whether to force start the cluster even if it is already up. Useful for upgrading SkyPilot runtime.

Return type:

RequestId[CloudVmRayResourceHandle]

Returns:

The request ID of the start request.

Request Returns:

None

Request Raises:

ValueError – argument values are invalid: (1) if down is set to True but idle_minutes_to_autostop is None; (2) if the specified cluster is the managed jobs controller, and either idle_minutes_to_autostop is not None or down is True (omit them to use the default autostop settings).
sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
sky.exceptions.NotSupportedError – if the cluster to restart was launched using a non-default backend that does not support this operation.
sky.exceptions.ClusterOwnerIdentitiesMismatchError – if the cluster to restart was launched by a different user.

`sky.down`#

sky.down(cluster_name, purge=False)[source]

Tears down a cluster.

Tearing down a cluster will delete all associated resources (all billing stops), and any data on the attached disks will be lost. Accelerators (e.g., TPUs) that are part of the cluster will be deleted too.

Parameters:

cluster_name (str) – name of the cluster to down.
purge (bool) – (Advanced) Forcefully remove the cluster from SkyPilot’s cluster table, even if the actual cluster termination failed on the cloud. WARNING: This flag should only be set sparingly in certain manual troubleshooting scenarios; with it set, it is the user’s responsibility to ensure there are no leaked instances and related resources.

Return type:

RequestId[None]

Returns:

The request ID of the down request.

Request Returns:

None

Request Raises:

sky.exceptions.ClusterDoesNotExist – the specified cluster does not exist.
RuntimeError – failed to tear down the cluster.
sky.exceptions.NotSupportedError – the specified cluster is the managed jobs controller.

`sky.status`#

sky.status(cluster_names=None, refresh=StatusRefreshMode.NONE, all_users=False)[source]

Gets cluster statuses.

If cluster_names is given, return those clusters. Otherwise, return all clusters.

Each cluster can have one of the following statuses:

INIT: The cluster may be live or down. It can happen in the following cases:
- Ongoing provisioning or runtime setup. (A sky.launch() has started but has not completed.)
- Or, the cluster is in an abnormal state, e.g., some cluster nodes are down, or the SkyPilot runtime is unhealthy. (To recover the cluster, try sky launch again on it.)
UP: Provisioning and runtime setup have succeeded and the cluster is live. (The most recent sky.launch() has completed successfully.)
STOPPED: The cluster is stopped and the storage is persisted. Use sky.start() to restart the cluster.

Autostop column:

The autostop column indicates how long the cluster will be autostopped after minutes of idling (no jobs running). If to_down is True, the cluster will be autodowned, rather than autostopped.

Getting up-to-date cluster statuses:

In normal cases where clusters are entirely managed by SkyPilot (i.e., no manual operations in cloud consoles) and no autostopping is used, the table returned by this command will accurately reflect the cluster statuses.
In cases where the clusters are changed outside of SkyPilot (e.g., manual operations in cloud consoles; unmanaged spot clusters getting preempted) or for autostop-enabled clusters, use refresh=True to query the latest cluster statuses from the cloud providers.

Parameters:

cluster_names (Optional[List[str]]) – a list of cluster names to query. If not provided, all clusters will be queried.
refresh (StatusRefreshMode) – whether to query the latest cluster statuses from the cloud provider(s).
all_users (bool) – whether to include all users’ clusters. By default, only the current user’s clusters are included.

Return type:

RequestId[List[Dict[str, Any]]]

Returns:

The request ID of the status request.

Request Returns:

cluster_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a cluster. If a cluster is found to be terminated or not found, it will be omitted from the returned list.

{
  'name': (str) cluster name,
  'launched_at': (int) timestamp of last launch on this cluster,
  'handle': (ResourceHandle) an internal handle to the cluster,
  'last_use': (str) the last command/entrypoint that affected this
  cluster,
  'status': (sky.ClusterStatus) cluster status,
  'autostop': (int) idle time before autostop,
  'to_down': (bool) whether autodown is used instead of autostop,
  'metadata': (dict) metadata of the cluster,
  'user_hash': (str) user hash of the cluster owner,
  'user_name': (str) user name of the cluster owner,
  'resources_str': (str) the resource string representation of the
    cluster,
}

`sky.autostop`#

sky.autostop(cluster_name, idle_minutes, wait_for=None, down=False)[source]

Schedules an autostop/autodown for a cluster.

Autostop/autodown will automatically stop or teardown a cluster when it becomes idle for a specified duration. Idleness means there are no in-progress (pending/running) jobs in a cluster’s job queue.

Idleness time of a cluster is reset to zero, whenever:

A job is submitted (sky.launch() or sky.exec()).
The cluster has restarted.
An autostop is set when there is no active setting. (Namely, either there’s never any autostop setting set, or the previous autostop setting was canceled.) This is useful for restarting the autostop timer.

Example: say a cluster without any autostop set has been idle for 1 hour, then an autostop of 30 minutes is set. The cluster will not be immediately autostopped. Instead, the idleness timer only starts counting after the autostop setting was set.

When multiple autostop settings are specified for the same cluster, the last setting takes precedence.

Parameters:

cluster_name (str) – name of the cluster.
idle_minutes (int) – the number of minutes of idleness (no pending/running jobs) after which the cluster will be stopped automatically. Setting to a negative number cancels any autostop/autodown setting.
wait_for (Optional[AutostopWaitFor]) –
determines the condition for resetting the idleness timer. This option works in conjunction with idle_minutes. Choices:
1. ”jobs_and_ssh” (default) - Wait for in-progress jobs and SSH connections to finish.
2. ”jobs” - Only wait for in-progress jobs.
3. ”none” - Wait for nothing; autostop right after idle_minutes.
down (bool) – if true, use autodown (tear down the cluster; non-restartable), rather than autostop (restartable).

Return type:

RequestId[None]

Returns:

The request ID of the autostop request.

Request Returns:

None

Request Raises:

sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend or the cluster is TPU VM Pod.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

Jobs SDK#

Cluster jobs SDK#

`sky.exec`#

sky.exec(task, cluster_name=None, dryrun=False, down=False, backend=None)[source]

Executes a task on an existing cluster.

This function performs two actions:

workdir syncing, if the task has a workdir specified;
executing the task’s run commands.

All other steps (provisioning, setup commands, file mounts syncing) are skipped. If any of those specifications changed in the task, this function will not reflect those changes. To ensure a cluster’s setup is up to date, use sky.launch() instead.

Execution and scheduling behavior:

The task will undergo job queue scheduling, respecting any specified resource requirement. It can be executed on any node of the cluster with enough resources.
The task is run under the workdir (if specified).
The task is run non-interactively (without a pseudo-terminal or pty), so interactive commands such as htop do not work. Use ssh my_cluster instead.

Parameters:

task (Union[Task, Dag]) – sky.Task, or sky.Dag (experimental; 1-task only) containing the task to execute.
cluster_name (Optional[str]) – name of an existing cluster to execute the task.
dryrun (bool) – if True, do not actually execute the task.
down (bool) – Tear down the cluster after all jobs finish (successfully or abnormally). If –idle-minutes-to-autostop is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.
backend (Optional[Backend]) – backend to use. If None, use the default backend (CloudVMRayBackend).

Return type:

RequestId[Tuple[Optional[int], Optional[ResourceHandle]]]

Returns:

The request ID of the exec request.

Request Returns:

job_id (Optional[int]) – the job ID of the submitted job. None if the backend is not CloudVmRayBackend, or no job is submitted to the cluster.
handle (Optional[backends.ResourceHandle]) – the handle to the cluster. None if dryrun.

Request Raises:

ValueError – if the specified cluster is not in UP status.
sky.exceptions.ClusterDoesNotExist – if the specified cluster does not exist.
sky.exceptions.NotSupportedError – if the specified cluster is a controller that does not support this operation.

`sky.queue`#

sky.queue(cluster_name, skip_finished=False, all_users=False)[source]

Gets the job queue of a cluster.

Parameters:

cluster_name (str) – name of the cluster.
skip_finished (bool) – if True, skip finished jobs.
all_users (bool) – if True, return jobs from all users.

Return type:

RequestId[List[dict]]

Returns:

The request ID of the queue request.

Request Returns:

job_records (List[Dict[str, Any]]) – A list of dicts for each job in the queue.

[
    {
        'job_id': (int) job id,
        'job_name': (str) job name,
        'username': (str) username,
        'user_hash': (str) user hash,
        'submitted_at': (int) timestamp of submitted,
        'start_at': (int) timestamp of started,
        'end_at': (int) timestamp of ended,
        'resources': (str) resources,
        'status': (job_lib.JobStatus) job status,
        'log_path': (str) log path,
    }
]

Request Raises:

sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.
sky.exceptions.CommandError – if failed to get the job queue with ssh.

`sky.job_status`#

sky.job_status(cluster_name, job_ids=None)[source]

Gets the status of jobs on a cluster.

Parameters:

cluster_name (str) – name of the cluster.
job_ids (Optional[List[int]]) – job ids. If None, get the status of the last job.

Return type:

RequestId[Dict[Optional[int], Optional[JobStatus]]]

Returns:

The request ID of the job status request.

Request Returns:

job_statuses (Dict[Optional[int], Optional[job_lib.JobStatus]]) – A mapping of job_id to job statuses. The status will be None if the job does not exist. If job_ids is None and there is no job on the cluster, it will return {None: None}.

Request Raises:

sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

`sky.tail_logs`#

sky.tail_logs(cluster_name, job_id, follow, tail=0, output_stream=None)[source]

Tails the logs of a job.

Parameters:

cluster_name (str) – name of the cluster.
job_id (Optional[int]) – job id.
follow (bool) – if True, follow the logs. Otherwise, return the logs immediately.
tail (int) – if > 0, tail the last N lines of the logs.
output_stream (Optional[TextIOBase]) – the stream to write the logs to. If None, print to the console.

Return type:

int

Returns:

Exit code based on success or failure of the job. 0 if success, 100 if the job failed. See exceptions.JobExitCode for possible exit codes.

Request Raises:

ValueError – if arguments are invalid or the cluster is not supported.
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

`sky.download_logs`#

sky.download_logs(cluster_name, job_ids)[source]

Downloads the logs of jobs.

Parameters:

cluster_name (str) – (str) name of the cluster.
job_ids (Optional[List[str]]) – (List[str]) job ids.

Return type:

Dict[str, str]

Returns:

The request ID of the download_logs request.

Request Returns:

job_log_paths (Dict[str, str]) – a mapping of job_id to local log path.

Request Raises:

sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the cluster is not based on CloudVmRayBackend.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

`sky.cancel`#

sky.cancel(cluster_name, all=False, all_users=False, job_ids=None, _try_cancel_if_cluster_is_init=False)[source]

Cancels jobs on a cluster.

Parameters:

cluster_name (str) – name of the cluster.
all (bool) – if True, cancel all jobs.
all_users (bool) – if True, cancel all jobs from all users.
job_ids (Optional[List[int]]) – a list of job IDs to cancel.
_try_cancel_if_cluster_is_init (bool) – (bool) whether to try cancelling the job even if the cluster is not UP, but the head node is still alive. This is used by the jobs controller to cancel the job when the worker node is preempted in the spot cluster.

Return type:

RequestId[None]

Returns:

The request ID of the cancel request.

Request Returns:

None

Request Raises:

ValueError – if arguments are invalid.
sky.exceptions.ClusterDoesNotExist – if the cluster does not exist.
sky.exceptions.ClusterNotUpError – if the cluster is not UP.
sky.exceptions.NotSupportedError – if the specified cluster is a controller that does not support this operation.
sky.exceptions.ClusterOwnerIdentityMismatchError – if the current user is not the same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError – if we fail to get the current user identity.

Managed jobs SDK#

`sky.jobs.launch`#

sky.jobs.launch(task, name=None, pool=None, num_jobs=None, _need_confirmation=False)[source]

Launches a managed job.

Please refer to sky.cli.job_launch for documentation.

Parameters:

task (Union[Task, Dag]) – sky.Task, or sky.Dag (experimental; 1-task only) to launch as a managed job.
name (Optional[str]) – Name of the managed job.
_need_confirmation (bool) – (Internal only) Whether to show a confirmation prompt before launching the job.

Return type:

RequestId[Tuple[Optional[int], Optional[ResourceHandle]]]

Returns:

The request ID of the launch request.

Request Returns:

job_id (Optional[int]) – Job ID for the managed job
controller_handle (Optional[ResourceHandle]) – ResourceHandle of the controller

Request Raises:

ValueError – cluster does not exist. Or, the entrypoint is not a valid chain dag.
sky.exceptions.NotSupportedError – the feature is not supported.

`sky.jobs.queue`#

sky.jobs.queue(refresh, skip_finished=False, all_users=False, job_ids=None)[source]

Gets statuses of managed jobs.

Please refer to sky.cli.job_queue for documentation.

Parameters:

refresh (bool) – Whether to restart the jobs controller if it is stopped.
skip_finished (bool) – Whether to skip finished jobs.
all_users (bool) – Whether to show all users’ jobs.
job_ids (Optional[List[int]]) – IDs of the managed jobs to show.

Return type:

RequestId[List[Dict[str, Any]]]

Returns:

The request ID of the queue request.

Request Returns:

job_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a job.

[
  {
    'job_id': (int) job id,
    'job_name': (str) job name,
    'resources': (str) resources of the job,
    'submitted_at': (float) timestamp of submission,
    'end_at': (float) timestamp of end,
    'job_duration': (float) duration in seconds,
    'recovery_count': (int) Number of retries,
    'status': (sky.jobs.ManagedJobStatus) of the job,
    'cluster_resources': (str) resources of the cluster,
    'region': (str) region of the cluster,
    'task_id': (int), set to 0 (except in pipelines, which may have multiple tasks), # pylint: disable=line-too-long
    'task_name': (str), same as job_name (except in pipelines, which may have multiple tasks), # pylint: disable=line-too-long
  }
]

Request Raises:

sky.exceptions.ClusterNotUpError – the jobs controller is not up or does not exist.
RuntimeError – if failed to get the managed jobs with ssh.

`sky.jobs.cancel`#

sky.jobs.cancel(name=None, job_ids=None, all=False, all_users=False, pool=None)[source]

Cancels managed jobs.

Please refer to sky.cli.job_cancel for documentation.

Parameters:

name (Optional[str]) – Name of the managed job to cancel.
job_ids (Optional[Sequence[int]]) – IDs of the managed jobs to cancel.
all (bool) – Whether to cancel all managed jobs.
all_users (bool) – Whether to cancel all managed jobs from all users.
pool (Optional[str]) – Pool name to cancel.

Return type:

RequestId[None]

Returns:

The request ID of the cancel request.

Request Raises:

sky.exceptions.ClusterNotUpError – the jobs controller is not up.
RuntimeError – failed to cancel the job.

`sky.jobs.tail_logs`#

sky.jobs.tail_logs(name=None, job_id=None, follow=True, controller=False, refresh=False, tail=None, output_stream=None)[source]

Tails logs of managed jobs.

You can provide either a job name or a job ID to tail logs. If both are not provided, the logs of the latest job will be shown.

Parameters:

name (Optional[str]) – Name of the managed job to tail logs.
job_id (Optional[int]) – ID of the managed job to tail logs.
follow (bool) – Whether to follow the logs.
controller (bool) – Whether to tail logs from the jobs controller.
refresh (bool) – Whether to restart the jobs controller if it is stopped.
tail (Optional[int]) – Number of lines to tail from the end of the log file.
output_stream (Optional[TextIOBase]) – The stream to write the logs to. If None, print to the console.

Return type:

int

Returns:

Exit code based on success or failure of the job. 0 if success, 100 if the job failed. See exceptions.JobExitCode for possible exit codes.

Request Raises:

ValueError – invalid arguments.
sky.exceptions.ClusterNotUpError – the jobs controller is not up.

Volumes SDK#

`sky.volumes.ls`#

sky.volumes.ls()[source]

Lists all volumes.

Return type:: RequestId[List[Dict[str, Any]]]
Returns:: The request ID of the list request.

`sky.volumes.apply`#

sky.volumes.apply(volume)[source]

Creates or registers a volume.

Parameters:: volume (Volume) – The volume to apply.
Return type:: RequestId[None]
Returns:: The request ID of the apply request.

`sky.volumes.delete`#

sky.volumes.delete(names)[source]

Deletes volumes.

Parameters:: names (List[str]) – List of volume names to delete.
Return type:: RequestId[None]
Returns:: The request ID of the delete request.

Serving SDK#

`sky.serve.up`#

sky.serve.up(task, service_name, _need_confirmation=False)[source]

Spins up a service.

Please refer to the sky.cli.serve_up for the document.

Parameters:

task (Union[Task, Dag]) – sky.Task to serve up.
service_name (str) – Name of the service.
_need_confirmation (bool) – (Internal only) Whether to show a confirmation prompt before spinning up the service.

Return type:

RequestId[Tuple[str, str]]

Returns:

The request ID of the up request.

Request Returns:

service_name (str) – The name of the service. Same if passed in as an argument.
endpoint (str) – The service endpoint.

`sky.serve.update`#

sky.serve.update(task, service_name, mode, _need_confirmation=False)[source]

Updates an existing service.

Please refer to the sky.cli.serve_update for the document.

Parameters:

task (Union[Task, Dag]) – sky.Task to update.
service_name (str) – Name of the service.
mode (UpdateMode) – Update mode, including: - sky.serve.UpdateMode.ROLLING - sky.serve.UpdateMode.BLUE_GREEN
_need_confirmation (bool) – (Internal only) Whether to show a confirmation prompt before updating the service.

Return type:

RequestId[None]

Returns:

The request ID of the update request.

Request Returns:

None

`sky.serve.down`#

sky.serve.down(service_names, all=False, purge=False)[source]

Tears down a service.

Please refer to the sky.cli.serve_down for the docs.

Parameters:

service_names (Union[str, List[str], None]) – Name of the service(s).
all (bool) – Whether to terminate all services.
purge (bool) – Whether to terminate services in a failed status. These services may potentially lead to resource leaks.

Return type:

RequestId[None]

Returns:

The request ID of the down request.

Request Returns:

None

Request Raises:

sky.exceptions.ClusterNotUpError – if the sky serve controller is not up.
ValueError – if the arguments are invalid.
RuntimeError – if failed to terminate the service.

`sky.serve.terminate_replica`#

sky.serve.terminate_replica(service_name, replica_id, purge)[source]

Tears down a specific replica for the given service.

Parameters:

service_name (str) – Name of the service.
replica_id (int) – ID of replica to terminate.
purge (bool) – Whether to terminate replicas in a failed status. These replicas may lead to resource leaks, so we require the user to explicitly specify this flag to make sure they are aware of this potential resource leak.

Return type:

RequestId[None]

Returns:

The request ID of the terminate replica request.

Request Raises:

sky.exceptions.ClusterNotUpError – if the sky sere controller is not up.
RuntimeError – if failed to terminate the replica.

`sky.serve.status`#

sky.serve.status(service_names)[source]

Gets service statuses.

If service_names is given, return those services. Otherwise, return all services.

Each returned value has the following fields:

{
    'name': (str) service name,
    'active_versions': (List[int]) a list of versions that are active,
    'controller_job_id': (int) the job id of the controller,
    'uptime': (int) uptime in seconds,
    'status': (sky.ServiceStatus) service status,
    'controller_port': (Optional[int]) controller port,
    'load_balancer_port': (Optional[int]) load balancer port,
    'endpoint': (Optional[str]) endpoint of the service,
    'policy': (Optional[str]) autoscaling policy description,
    'requested_resources_str': (str) str representation of
      requested resources,
    'load_balancing_policy': (str) load balancing policy name,
    'replica_info': (List[Dict[str, Any]]) replica information,
}

Each entry in replica_info has the following fields:

{
    'replica_id': (int) replica id,
    'name': (str) replica name,
    'status': (sky.serve.ReplicaStatus) replica status,
    'version': (int) replica version,
    'launched_at': (int) timestamp of launched,
    'handle': (ResourceHandle) handle of the replica cluster,
    'endpoint': (str) endpoint of the replica,
}

For possible service statuses and replica statuses, please refer to sky.cli.serve_status.

Parameters:

service_names (Union[str, List[str], None]) – a single or a list of service names to query. If None, query all services.

Return type:

RequestId[List[Dict[str, Any]]]

Returns:

The request ID of the status request.

Request Returns:

service_records (List[Dict[str, Any]]) – A list of dicts, with each dict containing the information of a service. If a service is not found, it will be omitted from the returned list.

Request Raises:

RuntimeError – if failed to get the service status.
exceptions.ClusterNotUpError – if the sky serve controller is not up.

`sky.serve.tail_logs`#

sky.serve.tail_logs(service_name, target, replica_id=None, follow=True, output_stream=None, tail=None)[source]

Tails logs for a service.

Usage:

sky.serve.tail_logs(
    service_name,
    target=<component>,
    follow=False, # Optionally, default to True
    # replica_id=3, # Must be specified when target is REPLICA.
)

target is a enum of sky.serve.ServiceComponent, which can be one of:

sky.serve.ServiceComponent.CONTROLLER
sky.serve.ServiceComponent.LOAD_BALANCER
sky.serve.ServiceComponent.REPLICA

Pass target as a lower-case string is also supported, e.g. target='controller'. To use sky.serve.ServiceComponent.REPLICA, you must specify replica_id.

To tail controller logs:

# follow default to True
sky.serve.tail_logs(
    service_name, target=sky.serve.ServiceComponent.CONTROLLER
)

To print replica 3 logs:

# Pass target as a lower-case string is also supported.
sky.serve.tail_logs(
    service_name, target='replica',
    follow=False, replica_id=3
)

Parameters:

service_name (str) – Name of the service.
target (Union[str, ServiceComponent]) – The component to tail logs.
replica_id (Optional[int]) – The ID of the replica to tail logs.
follow (bool) – Whether to follow the logs.
output_stream (Optional[TextIOBase]) – The stream to write the logs to. If None, print to the console.

Return type:

None

Returns:

The request ID of the tail logs request.

Request Raises:

sky.exceptions.ClusterNotUpError – the sky serve controller is not up.
ValueError – arguments not valid, or failed to tail the logs.

Task#

class sky.Task(name=None, *, setup=None, run=None, envs=None, secrets=None, workdir=None, num_nodes=None, file_mounts=None, storage_mounts=None, volumes=None, resources=None, docker_image=None, event_callback=None, blocked_resources=None, _file_mounts_mapping=None, _volume_mounts=None, _metadata=None, _user_specified_yaml=None)[source]#

Task: a computation to be run on the cloud.

__init__(name=None, *, setup=None, run=None, envs=None, secrets=None, workdir=None, num_nodes=None, file_mounts=None, storage_mounts=None, volumes=None, resources=None, docker_image=None, event_callback=None, blocked_resources=None, _file_mounts_mapping=None, _volume_mounts=None, _metadata=None, _user_specified_yaml=None)[source]#

Initializes a Task.

All fields are optional. Task.run is the actual program: either a shell command to run (str) or a command generator for different nodes (lambda; see below).

Optionally, call Task.set_resources() to set the resource requirements for this task. If not set, a default CPU-only requirement is assumed (the same as sky launch).

All setters of this class, Task.set_*(), return self, i.e., they are fluent APIs and can be chained together.

Example

# A Task that will sync up local workdir '.', containing
# requirements.txt and train.py.
sky.Task(setup='pip install requirements.txt',
         run='python train.py',
         workdir='.')

# An empty Task for provisioning a cluster.
task = sky.Task(num_nodes=n).set_resources(...)

# Chaining setters.
sky.Task().set_resources(...).set_file_mounts(...)

Parameters:

name (Optional[str]) – A string name for the Task for display purposes.
setup (Union[str, List[str], None]) – A setup command(s), which will be run before executing the run commands run, and executed under workdir.
run (Union[str, Callable[[int, List[str]], Optional[str]], List[str], None]) – The actual command for the task. If not None, either a shell command(s) (str, list(str)) or a command generator (callable). If latter, it must take a node rank and a list of node addresses as input and return a shell command (str) (valid to return None for some nodes, in which case no commands are run on them). Run commands will be run under workdir. Note the command generator should be a self-contained lambda.
envs (Optional[Dict[str, str]]) – A dictionary of environment variables to set before running the setup and run commands.
secrets (Optional[Dict[str, str]]) – A dictionary of secret environment variables to set before running the setup and run commands. These will be redacted in logs and YAML output.
workdir (Union[str, Dict[str, Any], None]) – The local working directory or a git repository. For a local working directory, this directory will be synced to a location on the remote VM(s), and setup and run commands will be run under that location (thus, they can rely on relative paths when invoking binaries). If a git repository is provided, the repository will be cloned to the working directory and the setup and run commands will be run under the cloned repository.
num_nodes (Optional[int]) – The number of nodes to provision for this Task. If None, treated as 1 node. If > 1, each node will execute its own setup/run command, where run can either be a str, meaning all nodes get the same command, or a lambda, with the semantics documented above.
file_mounts (Optional[Dict[str, str]]) – An optional dict of {remote_path: (local_path|cloud URI)}, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.
storage_mounts (Optional[Dict[str, Storage]]) – an optional dict of {mount_path: sky.Storage object}, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.
volumes (Optional[Dict[str, str]]) – A dict of volumes to be mounted for the task. The dict has the form of {mount_path: volume_name}.
resources (Union[Resources, List[Resources], Set[Resources], None]) – either a sky.Resources, a set of them, or a list of them. A set or a list of resources asks the optimizer to “pick the best of these resources” to run this task.
docker_image (Optional[str]) – (EXPERIMENTAL: Only in effect when LocalDockerBackend is used.) The base docker image that this Task will be built on. Defaults to ‘gpuci/miniforge-cuda:11.4-devel-ubuntu18.04’.
event_callback (Optional[str]) – A bash script that will be executed when the task changes state.
blocked_resources (Optional[Iterable[Resources]]) – A set of resources that this task cannot run on.
_file_mounts_mapping (Optional[Dict[str, str]]) – (Internal use only) A dictionary of file mounts mapping.
_volume_mounts (Optional[List[VolumeMount]]) – (Internal use only) A list of volume mounts.
_metadata (Optional[Dict[str, Any]]) – (Internal use only) A dictionary of metadata to be added to the task.
_user_specified_yaml (Optional[str]) – (Internal use only) A string of user-specified YAML config.

static from_yaml(yaml_path)[source]#

Initializes a task from a task YAML.

Example

task = sky.Task.from_yaml('/path/to/task.yaml')

Parameters:: yaml_path (str) – file path to a valid task yaml file.
Raises:: ValueError – if the path gets loaded into a str instead of a dict; or if there are any other parsing errors.
Return type:: Task

resolve_and_validate_volumes()[source]#

Resolve volumes config to volume mounts and validate them.

Raises:

exceptions.VolumeNotFoundError – if any volume is not found.
exceptions.VolumeTopologyConflictError – if there is conflict in the volumes and compute topology.

Return type:

None

set_volumes(volumes)[source]#

Sets the volumes for this task.

Parameters:: volumes (Dict[str, str]) – a dict of {mount_path: volume_name}.
Return type:: None

update_volumes(volumes)[source]#

Updates the volumes for this task.

Return type:: None

update_envs(envs)[source]#

Updates environment variables for use inside the setup/run commands.

Parameters:: envs (Union[None, List[Tuple[str, str]], Dict[str, str]]) – (optional) either a list of (env_name, value) or a dict {env_name: value}.
Return type:: Task
Returns:: self – The current task, with envs updated.
Raises:: ValueError – if various invalid inputs errors are detected.

update_secrets(secrets)[source]#

Updates secret env vars for use inside the setup/run commands.

Parameters:: secrets (Union[None, List[Tuple[str, str]], Dict[str, str]]) – (optional) either a list of (secret_name, value) or a dict {secret_name: value}.
Return type:: Task
Returns:: self – The current task, with secrets updated.
Raises:: ValueError – if various invalid inputs errors are detected.

set_resources(resources)[source]#

Sets the required resources to execute this task.

If this function is not called for a Task, default resource requirements will be used (8 vCPUs).

Parameters:: resources (Union[Resources, List[Resources], Set[Resources]]) – either a sky.Resources, a set of them, or a list of them. A set or a list of resources asks the optimizer to “pick the best of these resources” to run this task.
Return type:: Task
Returns:: self – The current task, with resources set.

set_resources_override(override_params)[source]#

Sets the override parameters for the resources.

Return type:: Task

set_service(service)[source]#

Sets the service spec for this task.

Parameters:: service (Optional[SkyServiceSpec]) – a SkyServiceSpec object.
Return type:: Task
Returns:: self – The current task, with service set.

set_file_mounts(file_mounts)[source]#

Sets the file mounts for this task.

Useful for syncing datasets, dotfiles, etc.

File mounts are a dictionary: {remote_path: local_path/cloud URI}. Local (or cloud) files/directories will be synced to the specified paths on the remote VM(s) where this Task will run.

Neither source or destimation paths can end with a slash.

Example

task.set_file_mounts({
    '~/.dotfile': '/local/.dotfile',
    # /remote/dir/ will contain the contents of /local/dir/.
    '/remote/dir': '/local/dir',
})

Parameters:: file_mounts (Optional[Dict[str, str]]) – an optional dict of {remote_path: local_path/cloud URI}, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.
Return type:: Task
Returns:: self – the current task, with file mounts set.

update_file_mounts(file_mounts)[source]#

Updates the file mounts for this task.

Different from set_file_mounts(), this function updates into the existing file_mounts (calls dict.update()), rather than overwriting it.

This should be called before provisioning in order to take effect.

Example

task.update_file_mounts({
    '~/.config': '~/Documents/config',
    '/tmp/workdir': '/local/workdir/cnn-cifar10',
})

Parameters:: file_mounts (Dict[str, str]) – a dict of {remote_path: local_path/cloud URI}, where remote means the VM(s) on which this Task will eventually run on, and local means the node from which the task is launched.
Return type:: Task
Returns:: self – the current task, with file mounts updated.
Raises:: ValueError – if input paths are invalid.

set_storage_mounts(storage_mounts)[source]#

Sets the storage mounts for this task.

Storage mounts are a dictionary: {mount_path: sky.Storage object}, each of which mounts a sky.Storage object (a cloud object store bucket) to a path inside the remote cluster.

A sky.Storage object can be created by uploading from a local directory (setting source), or backed by an existing cloud bucket (setting name to the bucket name; or setting source to the bucket URI).

Example

task.set_storage_mounts({
    '/remote/imagenet/': sky.Storage(name='my-bucket',
                                     source='/local/imagenet'),
})

Parameters:: storage_mounts (Optional[Dict[str, Storage]]) – an optional dict of {mount_path: sky.Storage object}, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.
Return type:: Task
Returns:: self – The current task, with storage mounts set.
Raises:: ValueError – if input paths are invalid.

update_storage_mounts(storage_mounts)[source]#

Updates the storage mounts for this task.

Different from set_storage_mounts(), this function updates into the existing storage_mounts (calls dict.update()), rather than overwriting it.

This should be called before provisioning in order to take effect.

Parameters:: storage_mounts (Dict[str, Storage]) – an optional dict of {mount_path: sky.Storage object}, where mount_path is the path inside the remote VM(s) where the Storage object will be mounted on.
Return type:: Task
Returns:: self – The current task, with storage mounts updated.
Raises:: ValueError – if input paths are invalid.

Resources#

class sky.Resources(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, infra=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, network_tier=None, ports=None, labels=None, autostop=None, priority=None, volumes=None, _docker_login_config=None, _docker_username_for_runpod=None, _is_image_managed=None, _requires_fuse=None, _cluster_config_overrides=None, _no_missing_accel_warnings=None)[source]#

Resources: compute requirements of Tasks.

This class is immutable once created (to ensure some validations are done whenever properties change). To update the property of an instance of Resources, use resources.copy(**new_properties).

Used:

for representing resource requests for tasks/apps
as a “filter” to get concrete launchable instances
for calculating billing
for provisioning on a cloud

__init__(cloud=None, instance_type=None, cpus=None, memory=None, accelerators=None, accelerator_args=None, infra=None, use_spot=None, job_recovery=None, region=None, zone=None, image_id=None, disk_size=None, disk_tier=None, network_tier=None, ports=None, labels=None, autostop=None, priority=None, volumes=None, _docker_login_config=None, _docker_username_for_runpod=None, _is_image_managed=None, _requires_fuse=None, _cluster_config_overrides=None, _no_missing_accel_warnings=None)[source]#

Initialize a Resources object.

All fields are optional. Resources.is_launchable decides whether the Resources is fully specified to launch an instance.

Examples

# Fully specified cloud and instance type (is_launchable() is True).
sky.Resources(infra='aws', instance_type='p3.2xlarge')
sky.Resources(infra='k8s/my-cluster-ctx', accelerators='V100')
sky.Resources(infra='gcp/us-central1', accelerators='V100')

# Specifying required resources; the system decides the
# cloud/instance type. The below are equivalent:
sky.Resources(accelerators='V100')
sky.Resources(accelerators='V100:1')
sky.Resources(accelerators={'V100': 1})
sky.Resources(cpus='2+', memory='16+', accelerators='V100')

Parameters:

cloud (Optional[Cloud]) – the cloud to use. Deprecated. Use infra instead.
instance_type (Optional[str]) – the instance type to use.
cpus (Union[None, int, float, str]) – the number of CPUs required for the task. If a str, must be a string of the form '2' or '2+', where the + indicates that the task requires at least 2 CPUs.
memory (Union[None, int, float, str]) – the amount of memory in GiB required. If a str, must be a string of the form '16' or '16+', where the + indicates that the task requires at least 16 GB of memory.
accelerators (Union[None, str, Dict[str, Union[int, float]]]) – the accelerators required. If a str, must be a string of the form 'V100' or 'V100:2', where the :2 indicates that the task requires 2 V100 GPUs. If a dict, must be a dict of the form {'V100': 2} or {'tpu-v2-8': 1}.
accelerator_args (Optional[Dict[str, str]]) – accelerator-specific arguments. For example, {'tpu_vm': True, 'runtime_version': 'tpu-vm-base'} for TPUs.
infra (Optional[str]) – a string specifying the infrastructure to use, in the format of “cloud/region” or “cloud/region/zone”. For example, aws/us-east-1 or k8s/my-cluster-ctx. This is an alternative to specifying cloud, region, and zone separately. If provided, it takes precedence over cloud, region, and zone parameters.
use_spot (Optional[bool]) – whether to use spot instances. If None, defaults to False.
job_recovery (Union[Dict[str, Union[str, int, None]], str, None]) –
the job recovery strategy to use for the managed job to recover the cluster from preemption. Refer to recovery_strategy module # pylint: disable=line-too-long for more details. When a dict is provided, it can have the following fields:
- strategy: the recovery strategy to use.
- max_restarts_on_errors: the max number of restarts on user code errors.
region (Optional[str]) – the region to use. Deprecated. Use infra instead.
zone (Optional[str]) – the zone to use. Deprecated. Use infra instead.
image_id (Union[Dict[Optional[str], str], str, None]) –
the image ID to use. If a str, must be a string of the image id from the cloud, such as AWS: 'ami-1234567890abcdef0', GCP: 'projects/my-project-id/global/images/my-image-name'; Or, a image tag provided by SkyPilot, such as AWS: 'skypilot:gpu-ubuntu-2004'. If a dict, must be a dict mapping from region to image ID, such as:
```
{
  'us-west1': 'ami-1234567890abcdef0',
  'us-east1': 'ami-1234567890abcdef0'
}
```
disk_size (Union[str, int, None]) – the size of the OS disk in GiB.
disk_tier (Union[str, DiskTier, None]) – the disk performance tier to use. If None, defaults to 'medium'.
network_tier (Union[str, NetworkTier, None]) – the network performance tier to use. If None, defaults to 'standard'.
ports (Union[int, str, List[str], Tuple[str], None]) – the ports to open on the instance.
labels (Optional[Dict[str, str]]) – the labels to apply to the instance. These are useful for assigning metadata that may be used by external tools. Implementation depends on the chosen cloud - On AWS, labels map to instance tags. On GCP, labels map to instance labels. On Kubernetes, labels map to pod labels. On other clouds, labels are not supported and will be ignored.
autostop (Union[bool, int, str, Dict[str, Any], None]) – the autostop configuration to use. For launched resources, may or may not correspond to the actual current autostop config.
priority (Optional[int]) – the priority for this resource configuration. Must be an integer from -1000 to 1000, where higher values indicate higher priority. If None, no priority is set.
volumes (Optional[List[Dict[str, Any]]]) – the volumes to mount on the instance.
_docker_login_config (Optional[DockerLoginConfig]) – the docker configuration to use. This includes the docker username, password, and registry server. If None, skip docker login.
_docker_username_for_runpod (Optional[str]) – the login username for the docker containers. This is used by RunPod to set the ssh user for the docker containers.
_requires_fuse (Optional[bool]) – whether the task requires FUSE mounting support. This is used internally by certain cloud implementations to do additional setup for FUSE mounting. This flag also safeguards against using FUSE mounting on existing clusters that do not support it. If None, defaults to False.

Raises:

ValueError – if some attributes are invalid.
exceptions.NoCloudAccessError – if no public cloud is enabled.

copy(**override)[source]#

Returns a copy of the given Resources.

Return type:: Resources

Enums#

class sky.ClusterStatus(value)[source]#

Cluster status as recorded in local cache.

This can be different from the actual cluster status, and can be refreshed by running sky status --refresh.

INIT = 'INIT'#

Initializing.

This means a provisioning has started but has not successfully finished. The cluster may be undergoing setup, may have failed setup, may be live or down.

UP = 'UP'#: The cluster is up. This means a provisioning has previously succeeded.

STOPPED = 'STOPPED'#: The cluster is stopped.

class sky.JobStatus(value)[source]#

Job status enum.

INIT = 'INIT'#: The job has been submitted, but not started yet.

PENDING = 'PENDING'#: The job is waiting for required resources.

SETTING_UP = 'SETTING_UP'#: The job is running the user’s setup script.

RUNNING = 'RUNNING'#: The job is running.

FAILED_DRIVER = 'FAILED_DRIVER'#: The job driver process failed.

SUCCEEDED = 'SUCCEEDED'#: The job finished successfully.

FAILED = 'FAILED'#: The job fails due to the user code.

FAILED_SETUP = 'FAILED_SETUP'#: The job setup failed.

CANCELLED = 'CANCELLED'#: The job is cancelled by the user.

class sky.StatusRefreshMode(value)[source]#

The mode of refreshing the status of a cluster.

NONE = 'NONE'#: Do not refresh any clusters.

AUTO = 'AUTO'#: Only refresh clusters if their autostop is set or have spot instances.

FORCE = 'FORCE'#: Enforce refreshing all clusters.

API server SDK#

`sky.get`#

sky.get(request_id)[source]

Waits for and gets the result of a request.

This function will not check the server health since /api/get is typically not the first API call in an SDK session and checking the server health may cause GET /api/get being sent to a restarted API server.

Parameters:: request_id (RequestId[TypeVar(T)]) – The request ID of the request to get. May be a full request ID or a prefix.
Return type:: TypeVar(T)
Returns:: The Request Returns of the specified request. See the documentation of the specific requests above for more details.
Raises:: Exception – It raises the same exceptions as the specific requests, see Request Raises in the documentation of the specific requests above.

`sky.stream_and_get`#

sky.stream_and_get(request_id=None, log_path=None, tail=None, follow=True, output_stream=None)[source]

Streams the logs of a request or a log file and gets the final result.

This will block until the request is finished. The request id can be a prefix of the full request id.

Parameters:

request_id (Optional[RequestId[TypeVar(T)]]) – The request ID of the request to stream. May be a full request ID or a prefix. If None, the latest request submitted to the API server is streamed. Using None request_id is not recommended in multi-user environments.
log_path (Optional[str]) – The path to the log file to stream.
tail (Optional[int]) – The number of lines to show from the end of the logs. If None, show all logs.
follow (bool) – Whether to follow the logs.
output_stream (Optional[TextIOBase]) – The output stream to write to. If None, print to the console.

Return type:

Optional[TypeVar(T)]

Returns:

The Request Returns of the specified request. See the documentation of the specific requests above for more details.

Raises:

Exception – It raises the same exceptions as the specific requests, see Request Raises in the documentation of the specific requests above.

`sky.api_status`#

sky.api_status(request_ids=None, all_status=False)[source]

Lists all requests.

Parameters:

request_ids (Optional[List[Union[RequestId[TypeVar(T)], str]]]) – The prefixes of the request IDs of the requests to query. If None, all requests are queried.
all_status (bool) – Whether to list all finished requests as well. This argument is ignored if request_ids is not None.

Return type:

List[RequestPayload]

Returns:

A list of request payloads.

`sky.api_cancel`#

sky.api_cancel(request_ids=None, all_users=False, silent=False)[source]

Aborts a request or all requests.

Parameters:

request_ids (Union[RequestId[TypeVar(T)], List[RequestId[TypeVar(T)]], str, List[str], None]) – The request ID(s) to abort. Can be a single string or a list of strings.
all_users (bool) – Whether to abort all requests from all users.
silent (bool) – Whether to suppress the output.

Return type:

RequestId[List[str]]

Returns:

The request ID of the abort request itself.

Request Returns:

A list of request IDs that were cancelled.

Raises:

click.BadParameter – If no request ID is specified and not all or all_users is not set.

`sky.api_info`#

sky.api_info()[source]

Gets the server’s status, commit and version.

Return type:

APIHealthResponse

Returns:

A dictionary containing the server’s status, commit and version.

{
    'status': 'healthy',
    'api_version': '1',
    'commit': 'abc1234567890',
    'version': '1.0.0',
    'version_on_disk': '1.0.0',
    'user': {
        'name': '[email protected]',
        'id': '12345abcd',
    },
}

Note that user may be None if we are not using an auth proxy.

`sky.api_start`#

sky.api_start(*, deploy=False, host='127.0.0.1', foreground=False, metrics=False, metrics_port=None, enable_basic_auth=False)[source]

Starts the API server.

It checks the existence of the API server and starts it if it does not exist.

Parameters:

deploy (bool) – Whether to deploy the API server, i.e. fully utilize the resources of the machine.
host (str) – The host to deploy the API server. It will be set to 0.0.0.0 if deploy is True, to allow remote access.
foreground (bool) – Whether to run the API server in the foreground (run in the current process).
metrics (bool) – Whether to export metrics of the API server.
metrics_port (Optional[int]) – The port to export metrics of the API server.
enable_basic_auth (bool) – Whether to enable basic authentication in the API server.

Return type:

None

Returns:

None

`sky.api_stop`#

sky.api_stop()[source]

Stops the API server.

It will do nothing if the API server is remotely hosted.

Return type:: None
Returns:: None

`sky.api_server_logs`#

sky.api_server_logs(follow=True, tail=None)[source]

Streams the API server logs.

Parameters:

follow (bool) – Whether to follow the logs.
tail (Optional[int]) – the number of lines to show from the end of the logs. If None, show all logs.

Return type:

None

Returns:

None

Python SDK#

Clusters SDK#

sky.launch#

sky.stop#

sky.start#

sky.down#

sky.status#

sky.autostop#

Jobs SDK#

Cluster jobs SDK#

sky.exec#

sky.queue#

sky.job_status#

sky.tail_logs#

sky.download_logs#

sky.cancel#

Managed jobs SDK#

sky.jobs.launch#

sky.jobs.queue#

sky.jobs.cancel#

sky.jobs.tail_logs#

Volumes SDK#

sky.volumes.ls#

sky.volumes.apply#

sky.volumes.delete#

Serving SDK#

sky.serve.up#

sky.serve.update#

sky.serve.down#

sky.serve.terminate_replica#

sky.serve.status#

sky.serve.tail_logs#

Task#

Resources#

Enums#

API server SDK#

sky.get#

sky.stream_and_get#

sky.api_status#

sky.api_cancel#

sky.api_info#

sky.api_start#

sky.api_stop#

sky.api_server_logs#

`sky.launch`#

`sky.stop`#

`sky.start`#

`sky.down`#

`sky.status`#

`sky.autostop`#

`sky.exec`#

`sky.queue`#

`sky.job_status`#

`sky.tail_logs`#

`sky.download_logs`#

`sky.cancel`#

`sky.jobs.launch`#

`sky.jobs.queue`#

`sky.jobs.cancel`#

`sky.jobs.tail_logs`#

`sky.volumes.ls`#

`sky.volumes.apply`#

`sky.volumes.delete`#

`sky.serve.up`#

`sky.serve.update`#

`sky.serve.down`#

`sky.serve.terminate_replica`#

`sky.serve.status`#

`sky.serve.tail_logs`#

`sky.get`#

`sky.stream_and_get`#

`sky.api_status`#

`sky.api_cancel`#

`sky.api_info`#

`sky.api_start`#

`sky.api_stop`#

`sky.api_server_logs`#