Quick guide to run `llama.cpp` with Docker

Discover how to quickly set up and run llama.cpp models using Docker. This guide covers interactive mode, server deployment, and essential command options for seamless integration.

Quick Start

Running a model in interactive mode

To run a language model interactively using Docker, use the command below. It launches a Docker container with the specified image and mounts a local directory containing your models. Here's a breakdown of the key flags:

--hf-repo: Indicates the Hugging Face repository to download the model from. This is useful when you want to use a model that's not already in your local directory.
--hf-file: Specifies the exact file to download from the Hugging Face repository.
-m: Specifies the model file to use.
-p: Provides the prompt for text generation.
-n: Sets the maximum number of tokens to generate.

docker run -it -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light \
  --hf-repo=lmstudio-community/gemma-2-2b-it-GGUF \
  --hf-file=gemma-2-2b-it-Q4_K_M.gguf \
  -m /models/gemma-2-2b.gguf \
  -p "Why is the sky blue?" \
  -n 512

In this example, the command will attempt to use the gemma-2-2b.gguf model from the local /models directory. If it's not found, it will download the gemma-2-2b-it-Q4_K_M.gguf file from the lmstudio-community/gemma-2-2b-it-GGUF Hugging Face repository. This feature allows for easy access to various models without manually downloading them beforehand.

Running a model as a server

For hosting the language model as a server, you can use the following command. This setup allows the model to be accessible over a network by mapping port 8000 of the container to port 8000 on your host machine. You can send requests to this model server after launching it.

docker run -it -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-2b.gguf --port 8000 --host 0.0.0.0

Docker image repository

Information regarding where to find the Docker images for running LLaMA models is available here. These images are hosted on GitHub Container Registry (GHCR) under the repository ghcr.io/ggerganov/llama.cpp. For further details on usage, refer to the documentation linked below.

Available Docker images

Several Docker images are available, each designed for specific use cases:

light: Contains only llama-cli, suitable for lightweight interactive usage.
server: Includes llama-server, ideal for running models as a server.
full: A comprehensive image that encompasses llama-cli, llama-server, llama-quantize, and convert_hf_to_gguf.py.

Supported architectures include amd64 and arm64.

CUDA support

For users with NVIDIA GPUs seeking hardware acceleration, an image variant optimized for CUDA is available.

Image variant: -cuda
Supported architecture: amd64

ROCm support

An image variant optimized for ROCm is also offered, catering to users with AMD GPUs who wish to utilize hardware acceleration.

Image variant: -rocm
Supported architectures: amd64, arm64

Quick guide to run llama.cpp with Docker ​

Quick Start ​

Running a model in interactive mode ​

Running a model as a server ​

Docker image repository ​

Available Docker images ​

CUDA support ​

ROCm support ​

Related posts ​