Appearance
Running Llama Stack: A step-by-step guide
Follow this step-by-step guide to run Llama Stack on your local machine, including downloading models, cloning the repository, and starting the server.
Distribution options
There are several options for running inference on the underlying Llama model, depending on available hardware and resources:
- For machines with powerful GPUs:
- meta-reference-gpu
- tgi
- For regular desktop machines:
- ollama
- For users with an API key for remote inference providers:
- together
- fireworks
Example: meta-reference-gpu
distribution
Here is a step-by-step guide to set up Llama Stack for the "meta-reference-gpu" distribution:
Prerequisites
- Make sure you have Git installed on your system.
- You need to have access to a single-node GPU to start a local server.
- You need to have Docker installed on your system.
Download models
Before proceeding, you need to download the Llama model checkpoints to the ~/.llama
directory. You can do this by following the installation guide here: installation guide.
Once you have downloaded the models, you should see a list of model checkpoints in the ~/.llama/checkpoints
directory, similar to the following:
$ ls ~/.llama/checkpoints
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
Clone the repository
Open a terminal and run the following command to clone the Llama Stack repository:
sh
git clone [email protected]:meta-llama/llama-stack.git
This will download the Llama Stack repository to your local machine.
Start Docker containers
Change into the distributions
directory within the cloned repository:
sh
cd llama-stack/distributions/meta-reference-gpu
Run the following command to start the "meta-reference-gpu" distribution using Docker Compose:
sh
docker compose up
This command will download and start running a pre-built Docker container for the "meta-reference-gpu" distribution.
Updating the model serving configuration
If you want to update the model currently being served by the distribution, you can modify the config.model
in the run.yaml
file. Make sure you have the model checkpoint downloaded in ~/.llama
.
You can list the available models to download using the command llama model list
, and download a model using the command llama model download <model_name>
Make sure to update the config.model
in the run.yaml
file to the name of the model you want to serve.
Send request to the server
Here's a how to send the chat completion request to the server.
Run the following command:
sh
curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
This will send the request to the server's /inference/chat_completion
API. The response from the server will be displayed in the terminal. It should look something like this:
json
{
"completion_message": {
"role": "assistant",
"content": "The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.",
"stop_reason": "out_of_tokens",
"tool_calls": []
},
"logprobs": null
}
This response indicates that the server has successfully processed the chat completion request and returned a generated poem.