Getting started with `llama.cpp` CLI

Learn how to use the llama.cpp CLI to generate text and engage in conversations with AI models. Get started with our quickstart guide and examples.

Quick start

Basic model invocation

The following command demonstrates how to invoke a model and generate text using llama-cli. It loads a specific model from the Hugging Face repository and uses it to generate text starting from the prompt "Once upon a time".

llama-cli \
  -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF \
  -hff Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  -p "Once upon a time"

Interactive conversation mode

To engage in a conversation with the model, you can use the following command. This loads the same model as before, but enables conversation mode and sets the system prompt to "You are a helpful assistant.".

The -cnv flag tells llama-cli to enter interactive mode, where it will respond to user input.

llama-cli \
  -hfr lmstudio-community/Llama-3.2-1B-Instruct-GGUF \
  -hff Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  -p "You are a helpful assistant." \
  -cnv

Command line options

Common options

These options are used to configure the basic behavior of llama-cli and llama-server. They control the input prompt, model selection, and output settings.

-c, --ctx-size: Size of the prompt context (default: 0, 0 = loaded from model)
-n, --predict, --n-predict: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
(env: LLAMA_ARG_N_PREDICT)
-p, --prompt: Prompt to start generation with
If -cnv is set, this will be used as system prompt
-f, --file: A file containing the prompt (default: none)
-m, --model: Model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
(env: LLAMA_ARG_MODEL)
-mu, --model-url: Model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-hfr, --hf-repo: Hugging Face model repository (default: unused)
(env: LLAMA_ARG_HF_REPO)
-hff, --hf-file: Hugging Face model file (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hft, --hf-token: Hugging Face access token (default: value from HF_TOKEN environment variable)
(env: HF_TOKEN)

Sampling options

These options control the sampling strategy used by the model to generate text. They allow you to fine-tune the output to suit your specific needs.

-s, --seed: RNG seed (default: -1, use random seed for -1)
--temp: Temperature (default: 0.8)
--top-k: Top-K sampling (default: 40, 0 = disabled)
--top-p: Top-P sampling (default: 0.9, 1.0 = disabled)
--min-p: Min-P sampling (default: 0.1, 0.0 = disabled)

Interaction Options

These options control the interactive behavior of llama-cli. They allow you to customize the conversation flow and input handling.

-r, --reverse-prompt: Halt generation at a specific prompt, return control in interactive mode
-cnv, --conversation: Run in conversation mode (default: false)
- Does not print special tokens and suffix/prefix
- Interactive mode is also enabled
-i, --interactive: Run in interactive mode (default: false)
-if, --interactive-first: Run in interactive mode and wait for input immediately (default: false)
--in-prefix: string to prefix for user inputs (default: empty)
--in-suffix: string to suffix for user inputs (default: empty)
--chat-template: Set a custom Jinja chat template (default: template taken from model's metadata)
(env: LLAMA_ARG_CHAT_TEMPLATE)
NOTE
If suffix/prefix are specified, template will be disabled only commonly used templates are accepted: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template

Quick guide to run llama.cpp with Docker

Getting started with llama.cpp CLI ​

Quick start ​

Basic model invocation ​

Interactive conversation mode ​

Command line options ​

Common options ​

Sampling options ​

Interaction Options ​

Related posts ​