TGI: Hugging Face Text Generation Inference#
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs) for text generation tasks from Hugging Face.
Launch a Single-Instance TGI Serving#
We can host the model with a single instance using service YAML:
sky launch -c tgi serve.yaml
A user can access the model with the following command:
ENDPOINT=$(sky status --endpoint 8080 tgi)
curl $(sky serve status tgi --endpoint)/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "What is Deep Learning?",
"parameters": {
"max_new_tokens": 20
}
}'
The output should be similar to the following:
{
"generated_text": "What is Deep Learning? Deep Learning is a subfield of machine learning that is concerned with algorithms inspired by the structure and function of the brain called artificial neural networks."
}
Scale the Serving with SkyPilot Serve#
Using the same YAML, we can easily scale the model serving across multiple instances, regions and clouds with SkyServe:
sky serve up -n tgi serve.yaml
After the service is launched, we can access the model with the following command:
ENDPOINT=$(sky serve status --endpoint tgi)
curl $ENDPOINT/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "What is Deep Learning?",
"parameters": {
"max_new_tokens": 20
}
}'