Skip to content

llama-cpp

llama.cpp server overview and usage

Get an overview of the llama.cpp server and its usage, including configuration options and API endpoints.

Quick start

Start the llama-server with a pre-trained model from the Hugging Face model hub. The server will download the specified model and start on port 8000 by default.

sh
llama-server -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF -hff Llama-3.2-1B-Instruct-Q4_K_M.gguf

Command line options

Common options

  • -c, --ctx-size: Size of the prompt context (default: 0, 0 = loaded from model)
    (env: LLAMA_ARG_CTX_SIZE) Here is the wrapped text with backticks ("`") around numbers, environment variables, and command options:
  • -n, --predict, --n-predict: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
    (env: LLAMA_ARG_N_PREDICT)
  • -keep: Number of tokens to keep from the initial prompt (default: 0, -1 = all)
  • -m, --model: Model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
    (env: LLAMA_ARG_MODEL)
  • -mu, --model-url: Model download url (default: unused)
    (env: LLAMA_ARG_MODEL_URL)
  • -hfr, --hf-repo: Hugging Face model repository (default: unused)
    (env: LLAMA_ARG_HF_REPO)
  • -hff, --hf-file: Hugging Face model file (default: unused)
    (env: LLAMA_ARG_HF_FILE)
  • -hft, --hf-token: Hugging Face access token (default: value from HF_TOKEN environment variable)
    (env: HF_TOKEN)

Server options

  • --host: IP address to listen (default: 127.0.0.1)
    (env: LLAMA_ARG_HOST)
  • --port: Port to listen (default: 8080)
    (env: LLAMA_ARG_PORT)
  • --api-key: API key to use for authentication (default: none)
    (env: LLAMA_API_KEY)
  • --api-key-file: Path to file containing API keys (default: none)
  • --ssl-key-file: Path to file a PEM-encoded SSL private key
    (env: LLAMA_ARG_SSL_KEY_FILE)
  • --ssl-cert-file: Path to file a PEM-encoded SSL certificate
    (env: LLAMA_ARG_SSL_CERT_FILE)
  • -to, --timeout: Server read/write timeout in seconds (default: 600)
    (env: LLAMA_ARG_TIMEOUT)
  • --metrics: Enable prometheus compatible metrics endpoint (default: disabled)
    (env: LLAMA_ARG_ENDPOINT_METRICS)

Default parameters

  • Seed: -1 (random) - The seed value for generating random numbers.
  • Temperature: 0.8 - A value that controls the randomness of the output. Lower values produce more conservative results, while higher values produce more diverse results.
  • Top-k: 40 - The number of highest-probability tokens to consider when generating output.
  • Top-p: 0.9 - The cumulative probability threshold for the top tokens to consider when generating output.
  • Min-p: 0.1 - The minimum probability threshold for a token to be considered in the output.
  • Presence penalty: 0.0 (disabled) - A penalty applied to tokens that are already present in the output.
  • Frequency penalty: 0.0 (disabled) - A penalty applied to tokens that are frequently used in the output.

Endpoints

GET /health: Health check

This endpoint checks the health status of the service.

  • Success Response
    • Code: 200
    • Example: {"status": "ok" }
  • Error Response
    • Code: 503
    • Example: {"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}

POST /v1/chat/completions: OpenAI compatible chat completions

This endpoint generates text completions in response to user input, compatible with the OpenAI API.

python
import openai

client = openai.OpenAI(base_url='http://llama-cpp:8080/v1', api_key='dummy')

response = client.chat.completions.create(
  model='dummy',
  messages=[
    {'role': 'system', 'content': 'You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.'},
    {'role': 'user', 'content': 'Write a limerick about python exceptions'},
  ],
  stream=True,
)

for chunk in response:
  print(chunk.choices[0].delta.content or '', end='', flush=True)