Skip to content

llama-stack

Running Llama Stack: A step-by-step guide

Follow this step-by-step guide to run Llama Stack on your local machine, including downloading models, cloning the repository, and starting the server.

Distribution options

There are several options for running inference on the underlying Llama model, depending on available hardware and resources:

  1. For machines with powerful GPUs:
    • meta-reference-gpu
    • tgi
  2. For regular desktop machines:
    • ollama
  3. For users with an API key for remote inference providers:
    • together
    • fireworks

Example: meta-reference-gpu distribution

Here is a step-by-step guide to set up Llama Stack for the "meta-reference-gpu" distribution:

Prerequisites

  • Make sure you have Git installed on your system.
  • You need to have access to a single-node GPU to start a local server.
  • You need to have Docker installed on your system.

Download models

Before proceeding, you need to download the Llama model checkpoints to the ~/.llama directory. You can do this by following the installation guide here: installation guide.

Once you have downloaded the models, you should see a list of model checkpoints in the ~/.llama/checkpoints directory, similar to the following:

$ ls ~/.llama/checkpoints
Llama3.1-8B           Llama3.2-11B-Vision-Instruct  Llama3.2-1B-Instruct  Llama3.2-90B-Vision-Instruct  Llama-Guard-3-8B
Llama3.1-8B-Instruct  Llama3.2-1B                   Llama3.2-3B-Instruct  Llama-Guard-3-1B              Prompt-Guard-86M

Clone the repository

Open a terminal and run the following command to clone the Llama Stack repository:

sh
git clone [email protected]:meta-llama/llama-stack.git

This will download the Llama Stack repository to your local machine.

Start Docker containers

Change into the distributions directory within the cloned repository:

sh
cd llama-stack/distributions/meta-reference-gpu

Run the following command to start the "meta-reference-gpu" distribution using Docker Compose:

sh
docker compose up

This command will download and start running a pre-built Docker container for the "meta-reference-gpu" distribution.

Updating the model serving configuration

If you want to update the model currently being served by the distribution, you can modify the config.model in the run.yaml file. Make sure you have the model checkpoint downloaded in ~/.llama.

You can list the available models to download using the command llama model list, and download a model using the command llama model download <model_name>

Make sure to update the config.model in the run.yaml file to the name of the model you want to serve.

Send request to the server

Here's a how to send the chat completion request to the server.

Run the following command:

sh
curl http://localhost:5000/inference/chat_completion \
  -H "Content-Type: application/json" \
  -d '{
      "model": "Llama3.1-8B-Instruct",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Write me a 2 sentence poem about the moon"}
      ],
      "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
  }'

This will send the request to the server's /inference/chat_completion API. The response from the server will be displayed in the terminal. It should look something like this:

json
{
  "completion_message": {
    "role": "assistant",
    "content": "The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}

This response indicates that the server has successfully processed the chat completion request and returned a generated poem.