`llama.cpp` server overview and usage

Get an overview of the llama.cpp server and its usage, including configuration options and API endpoints.

Quick start

Start the llama-server with a pre-trained model from the Hugging Face model hub. The server will download the specified model and start on port 8000 by default.

llama-server -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF -hff Llama-3.2-1B-Instruct-Q4_K_M.gguf

Command line options

Common options

-c, --ctx-size: Size of the prompt context (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE) Here is the wrapped text with backticks ("`") around numbers, environment variables, and command options:
-n, --predict, --n-predict: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
(env: LLAMA_ARG_N_PREDICT)
-keep: Number of tokens to keep from the initial prompt (default: 0, -1 = all)
-m, --model: Model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
(env: LLAMA_ARG_MODEL)
-mu, --model-url: Model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-hfr, --hf-repo: Hugging Face model repository (default: unused)
(env: LLAMA_ARG_HF_REPO)
-hff, --hf-file: Hugging Face model file (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hft, --hf-token: Hugging Face access token (default: value from HF_TOKEN environment variable)
(env: HF_TOKEN)

Server options

--host: IP address to listen (default: 127.0.0.1)
(env: LLAMA_ARG_HOST)
--port: Port to listen (default: 8080)
(env: LLAMA_ARG_PORT)
--api-key: API key to use for authentication (default: none)
(env: LLAMA_API_KEY)
--api-key-file: Path to file containing API keys (default: none)
--ssl-key-file: Path to file a PEM-encoded SSL private key
(env: LLAMA_ARG_SSL_KEY_FILE)
--ssl-cert-file: Path to file a PEM-encoded SSL certificate
(env: LLAMA_ARG_SSL_CERT_FILE)
-to, --timeout: Server read/write timeout in seconds (default: 600)
(env: LLAMA_ARG_TIMEOUT)
--metrics: Enable prometheus compatible metrics endpoint (default: disabled)
(env: LLAMA_ARG_ENDPOINT_METRICS)

Default parameters

Seed: -1 (random) - The seed value for generating random numbers.
Temperature: 0.8 - A value that controls the randomness of the output. Lower values produce more conservative results, while higher values produce more diverse results.
Top-k: 40 - The number of highest-probability tokens to consider when generating output.
Top-p: 0.9 - The cumulative probability threshold for the top tokens to consider when generating output.
Min-p: 0.1 - The minimum probability threshold for a token to be considered in the output.
Presence penalty: 0.0 (disabled) - A penalty applied to tokens that are already present in the output.
Frequency penalty: 0.0 (disabled) - A penalty applied to tokens that are frequently used in the output.

Endpoints

GET `/health`: Health check

This endpoint checks the health status of the service.

Success Response
- Code: 200
- Example: {"status": "ok" }
Error Response
- Code: 503
- Example: {"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}

POST `/v1/chat/completions`: OpenAI compatible chat completions

This endpoint generates text completions in response to user input, compatible with the OpenAI API.

python

import openai

client = openai.OpenAI(base_url='http://llama-cpp:8080/v1', api_key='dummy')

response = client.chat.completions.create(
  model='dummy',
  messages=[
    {'role': 'system', 'content': 'You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.'},
    {'role': 'user', 'content': 'Write a limerick about python exceptions'},
  ],
  stream=True,
)

for chunk in response:
  print(chunk.choices[0].delta.content or '', end='', flush=True)

Quick guide to run llama.cpp with Docker

llama.cpp server overview and usage ​

Quick start ​

Command line options ​

Common options ​

Server options ​

Default parameters ​

Endpoints ​

GET /health: Health check ​

POST /v1/chat/completions: OpenAI compatible chat completions ​

Related posts ​