Appearance
Getting started with llama.cpp CLI
Learn how to use the llama.cpp CLI to generate text and engage in conversations with AI models. Get started with our quickstart guide and examples.
Quick start
Basic model invocation
The following command demonstrates how to invoke a model and generate text using llama-cli
. It loads a specific model from the Hugging Face repository and uses it to generate text starting from the prompt "Once upon a time".
sh
llama-cli \
-hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF \
-hff Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-p "Once upon a time"
Interactive conversation mode
To engage in a conversation with the model, you can use the following command. This loads the same model as before, but enables conversation mode and sets the system prompt to "You are a helpful assistant.".
The -cnv
flag tells llama-cli
to enter interactive mode, where it will respond to user input.
sh
llama-cli \
-hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF \
-hff Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-p "You are a helpful assistant." \
-cnv
Command line options
Common options
These options are used to configure the basic behavior of llama-cli
and llama-server
. They control the input prompt, model selection, and output settings.
-c
,--ctx-size
: Size of the prompt context (default: 0, 0 = loaded from model)-n
,--predict
,--n-predict
: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
(env:LLAMA_ARG_N_PREDICT
)-p
,--prompt
: Prompt to start generation with
If-cnv
is set, this will be used as system prompt-f
,--file
: A file containing the prompt (default: none)-m
,--model
: Model path (default:models/$filename
with filename from--hf-file
or--model-url
if set, otherwisemodels/7B/ggml-model-f16.gguf
)
(env:LLAMA_ARG_MODEL
)-mu
,--model-url
: Model download url (default: unused)
(env:LLAMA_ARG_MODEL_URL
)-hfr
,--hf-repo
: Hugging Face model repository (default: unused)
(env:LLAMA_ARG_HF_REPO
)-hff
,--hf-file
: Hugging Face model file (default: unused)
(env:LLAMA_ARG_HF_FILE
)-hft
,--hf-token
: Hugging Face access token (default: value fromHF_TOKEN
environment variable)
(env:HF_TOKEN
)
Sampling options
These options control the sampling strategy used by the model to generate text. They allow you to fine-tune the output to suit your specific needs.
-s
,--seed
: RNG seed (default:-1
, use random seed for-1
)--temp
: Temperature (default:0.8
)--top-k
: Top-K sampling (default:40
,0
= disabled)--top-p
: Top-P sampling (default:0.9
,1.0
= disabled)--min-p
: Min-P sampling (default:0.1
,0.0
= disabled)
Interaction Options
These options control the interactive behavior of llama-cli
. They allow you to customize the conversation flow and input handling.
-r
,--reverse-prompt
: Halt generation at a specific prompt, return control in interactive mode-cnv
,--conversation
: Run in conversation mode (default: false)- Does not print special tokens and suffix/prefix
- Interactive mode is also enabled
-i
,--interactive
: Run in interactive mode (default: false)-if
,--interactive-first
: Run in interactive mode and wait for input immediately (default: false)--in-prefix
: string to prefix for user inputs (default: empty)--in-suffix
: string to suffix for user inputs (default: empty)--chat-template
: Set a custom Jinja chat template (default: template taken from model's metadata)
(env:LLAMA_ARG_CHAT_TEMPLATE
)NOTE
If suffix/prefix are specified, template will be disabled only commonly used templates are accepted: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template