Skip to main content
Ctrl+K

Site Navigation

  • Docs
  • Blog
  • Community
  • Slack
  • Twitter
  • GitHub

Site Navigation

  • Docs
  • Blog
  • Community
  • Slack
  • Twitter
  • GitHub

Getting Started

  • Overview
  • Installation
  • Quickstart
  • Examples
    • Quickstart: PyTorch
    • Training
      • Axolotl
      • DeepSpeed
      • Distributed PyTorch
      • Distributed TensorFlow
      • Finetuning Llama 3
      • Finetuning Llama 2
      • NeMo
      • Ray
      • Training on TPUs
      • Unsloth
      • Vertex AI
    • Serving
      • vLLM
      • SGLang
      • Ollama
      • Hugging Face TGI
      • LoRAX
      • Cog
    • Models
      • DeepSeek-R1
      • DeepSeek-R1 Distilled
      • DeepSeek-Janus
      • Gemma 3
      • Llama 4
      • Llama 3.2
      • Llama 3.1
      • Llama 3
      • Llama 2
      • CodeLlama
      • Pixtral
      • Mixtral
      • Mistral 7B
      • Qwen 2.5
      • Yi
      • Gemma
      • DBRX
      • GPT-2 via llm.c
      • Vicuna
    • Other Frameworks
      • Airflow
      • AWS EFA
      • Cross-cloud data transfer
      • DVC
      • GCP DWS/Kueue
      • GCP GPUDirect-TCPX
      • Jupyter
      • MLFlow
      • MPI
    • AI Applications
      • DeepSeek-R1 for RAG
      • Large-Scale Batch Inference
      • Image Vector Database
      • Tabby: Coding Assistant
      • LocalGPT: Chat with PDF
      • Stable Diffusion
  • Concept: Sky Computing

Clusters

  • Start a Development Cluster
  • Cluster Jobs
  • Provisioning Compute
  • Autostop and Autodown

Jobs

  • Managed Jobs
  • Multi-Node Jobs
  • Many Parallel Jobs
  • Model Training Guide

Model Serving

  • Getting Started
  • Serving User Guides
    • Autoscaling
    • Updating a Service
    • Authorization
    • Using Spot Instances for Serving
    • HTTPS Encryption

Infra Choices

  • Using Kubernetes
    • Getting Started
    • Kubernetes Cluster Setup
      • Deployment Guides
      • Exposing Services
    • Priority and Preemption
    • Multiple Kubernetes Clusters
    • SkyPilot vs. Vanilla Kubernetes
    • Examples
      • Dynamic Workload Scheduler
      • Kueue
      • Multi-region Kubernetes
    • Kubernetes Troubleshooting
  • Using Existing Machines
  • Using Reservations
  • Using Cloud VMs
  • GPUs and Accelerators
    • Using Google TPUs

Data

  • Cloud Buckets
  • Syncing Code and Artifacts

User Guides

  • Asynchronous Execution
  • Secrets and Environment Variables
  • Docker Containers
  • Opening Ports
  • Usage Collection
  • Frequently Asked Questions

Administrator Guides

  • Team Deployment
    • Deploying API Server
    • Upgrading API Server
    • Performance Best Practices
    • Troubleshooting
    • Helm Chart Reference
  • Cloud Accounts and Permissions
    • AWS
    • GCP
    • Nebius
    • vSphere
    • Kubernetes
  • Requesting Quota Increase
  • Admin Policies

References

  • SkyPilot YAML
  • CLI
  • Python SDK
  • Advanced Configuration
    • Configuration Sources
  • Developer Guides
    • Contributing to SkyPilot
    • Guide: Adding a New Cloud

Serving User Guides#

  • Autoscaling
    • Fixed replicas
    • Enabling autoscaling
    • Scaling delay
    • Scale-to-zero
  • Updating a Service
    • Rolling update
      • Example
    • Blue-green update
      • Example
  • Authorization
    • Setup API keys
  • Using Spot Instances for Serving
    • Base on-demand fallback
    • Dynamic on-demand fallback
    • Example
  • HTTPS Encryption
    • HTTPS encrypted endpoint

previous

Serving Models

next

Autoscaling

Edit on GitHub

© Copyright 2025, SkyPilot Team.