Appearance
llama.cpp server overview and usage
Get an overview of the llama.cpp server and its usage, including configuration options and API endpoints.
Quick start
Start the llama-server
with a pre-trained model from the Hugging Face model hub. The server will download the specified model and start on port 8000
by default.
sh
llama-server -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF -hff Llama-3.2-1B-Instruct-Q4_K_M.gguf
Command line options
Common options
-c
,--ctx-size
: Size of the prompt context (default: 0, 0 = loaded from model)
(env:LLAMA_ARG_CTX_SIZE
) Here is the wrapped text with backticks ("`") around numbers, environment variables, and command options:-n
,--predict
,--n-predict
: Number of tokens to predict (default:-1
,-1
= infinity,-2
= until context filled)
(env:LLAMA_ARG_N_PREDICT
)-keep
: Number of tokens to keep from the initial prompt (default:0
,-1
= all)-m
,--model
: Model path (default:models/$filename
with filename from--hf-file
or--model-url
if set, otherwisemodels/7B/ggml-model-f16.gguf
)
(env:LLAMA_ARG_MODEL
)-mu
,--model-url
: Model download url (default: unused)
(env:LLAMA_ARG_MODEL_URL
)-hfr
,--hf-repo
: Hugging Face model repository (default: unused)
(env:LLAMA_ARG_HF_REPO
)-hff
,--hf-file
: Hugging Face model file (default: unused)
(env:LLAMA_ARG_HF_FILE
)-hft
,--hf-token
: Hugging Face access token (default: value fromHF_TOKEN
environment variable)
(env:HF_TOKEN
)
Server options
--host
: IP address to listen (default:127.0.0.1
)
(env:LLAMA_ARG_HOST
)--port
: Port to listen (default:8080
)
(env:LLAMA_ARG_PORT
)--api-key
: API key to use for authentication (default: none)
(env:LLAMA_API_KEY
)--api-key-file
: Path to file containing API keys (default: none)--ssl-key-file
: Path to file a PEM-encoded SSL private key
(env:LLAMA_ARG_SSL_KEY_FILE
)--ssl-cert-file
: Path to file a PEM-encoded SSL certificate
(env:LLAMA_ARG_SSL_CERT_FILE
)-to
,--timeout
: Server read/write timeout in seconds (default:600
)
(env:LLAMA_ARG_TIMEOUT
)--metrics
: Enable prometheus compatible metrics endpoint (default: disabled)
(env:LLAMA_ARG_ENDPOINT_METRICS
)
Default parameters
- Seed:
-1
(random) - The seed value for generating random numbers. - Temperature:
0.8
- A value that controls the randomness of the output. Lower values produce more conservative results, while higher values produce more diverse results. - Top-k:
40
- The number of highest-probability tokens to consider when generating output. - Top-p:
0.9
- The cumulative probability threshold for the top tokens to consider when generating output. - Min-p:
0.1
- The minimum probability threshold for a token to be considered in the output. - Presence penalty:
0.0
(disabled) - A penalty applied to tokens that are already present in the output. - Frequency penalty:
0.0
(disabled) - A penalty applied to tokens that are frequently used in the output.
Endpoints
GET /health
: Health check
This endpoint checks the health status of the service.
- Success Response
- Code:
200
- Example:
{"status": "ok" }
- Code:
- Error Response
- Code:
503
- Example:
{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}
- Code:
POST /v1/chat/completions
: OpenAI compatible chat completions
This endpoint generates text completions in response to user input, compatible with the OpenAI API.
python
import openai
client = openai.OpenAI(base_url='http://llama-cpp:8080/v1', api_key='dummy')
response = client.chat.completions.create(
model='dummy',
messages=[
{'role': 'system', 'content': 'You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.'},
{'role': 'user', 'content': 'Write a limerick about python exceptions'},
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or '', end='', flush=True)