Appearance
Quick guide to run llama.cpp with Docker
Discover how to quickly set up and run llama.cpp models using Docker. This guide covers interactive mode, server deployment, and essential command options for seamless integration.
Quick Start
Running a model in interactive mode
To run a language model interactively using Docker, use the command below. It launches a Docker container with the specified image and mounts a local directory containing your models. Here's a breakdown of the key flags:
--hf-repo
: Indicates the Hugging Face repository to download the model from. This is useful when you want to use a model that's not already in your local directory.--hf-file
: Specifies the exact file to download from the Hugging Face repository.-m
: Specifies the model file to use.-p
: Provides the prompt for text generation.-n
: Sets the maximum number of tokens to generate.
sh
docker run -it -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light \
--hf-repo=lmstudio-community/gemma-2-2b-it-GGUF \
--hf-file=gemma-2-2b-it-Q4_K_M.gguf \
-m /models/gemma-2-2b.gguf \
-p "Why is the sky blue?" \
-n 512
In this example, the command will attempt to use the gemma-2-2b.gguf
model from the local /models
directory. If it's not found, it will download the gemma-2-2b-it-Q4_K_M.gguf
file from the lmstudio-community/gemma-2-2b-it-GGUF
Hugging Face repository. This feature allows for easy access to various models without manually downloading them beforehand.
Running a model as a server
For hosting the language model as a server, you can use the following command. This setup allows the model to be accessible over a network by mapping port 8000
of the container to port 8000
on your host machine. You can send requests to this model server after launching it.
sh
docker run -it -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/gemma-2-2b.gguf --port 8000 --host 0.0.0.0
Docker image repository
Information regarding where to find the Docker images for running LLaMA models is available here. These images are hosted on GitHub Container Registry (GHCR) under the repository ghcr.io/ggerganov/llama.cpp
. For further details on usage, refer to the documentation linked below.
- https://github.com/ggerganov/llama.cpp/pkgs/container/llama.cpp
- https://github.com/ggerganov/llama.cpp/blob/master/docs/docker.md
Available Docker images
Several Docker images are available, each designed for specific use cases:
light
: Contains onlyllama-cli
, suitable for lightweight interactive usage.server
: Includesllama-server
, ideal for running models as a server.full
: A comprehensive image that encompassesllama-cli
,llama-server
,llama-quantize
, andconvert_hf_to_gguf.py
.
Supported architectures include amd64
and arm64
.
CUDA support
For users with NVIDIA GPUs seeking hardware acceleration, an image variant optimized for CUDA is available.
- Image variant:
-cuda
- Supported architecture:
amd64
ROCm support
An image variant optimized for ROCm is also offered, catering to users with AMD GPUs who wish to utilize hardware acceleration.
- Image variant:
-rocm
- Supported architectures:
amd64
,arm64