Install `llama-cpp-python` with ease

Discover the ease of installing llama-cpp-python, a versatile Python package, and leveraging its multiple backends for optimized performance.

Introduction

llama-cpp-python is a Python package that provides bindings for the llama.cpp library, allowing for low-level access to the C API and high-level Python APIs for text completion. It also includes an OpenAI-compatible web server for local code completion, function calling, vision API support, and multi-model support.

Quick start

To quickly install the llama-cpp-python package, you can use the following command:

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

This command installs a pre-built wheel of the package from GitHub releases, which is specifically built for CPU-only environments. For improved performance, consider using the backends.

Integrating backends

Some common backends are supported to allow the llama-cpp-python library to be run on different hardware platforms and accelerate computations using various acceleration technologies.

CUDA

A GPU-based backend that uses NVIDIA's CUDA architecture for accelerated computations, ideal for systems with NVIDIA graphics cards.

Use pre-built wheel:

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>  # cu121, cu122, cu123, cu124, cu125

Build locally:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

Metal

A GPU-based backend that uses Apple's Metal API for macOS systems, providing accelerated computations on Apple devices.

Use pre-built wheel:

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal

Build locally:

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

hipBLAS (ROCm)

A GPU-based backend that uses the hipBLAS library for AMD graphics cards, allowing for accelerated computations on AMD systems.

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python

Vulkan

A GPU-based backend that uses the Vulkan API for cross-platform, vendor-agnostic accelerated computations on a wide range of graphics cards.

CMAKE_ARGS="-DGGML_VULKAN=on" pip install llama-cpp-python

SYCL

A backend that uses the SYCL (pronounced "sickle") programming model for heterogeneous computing, allowing for accelerated computations on a variety of devices, including CPUs, FPGAs, and Intel integrated GPUs (iGPU).

CMAKE_ARGS="-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

Troubleshooting

If you encounter any issues during the installation process, here are some steps to help you resolve them:

Check the installation logs: If the installation fails, try adding the --verbose flag to the pip install command to see the full CMake build log. This can help you identify the source of the issue.
Verify dependencies: Make sure you have all the required dependencies installed, including the necessary C++ compiler and libraries.

Getting started with llama-cpp-python Docker image

Install llama-cpp-python with ease ​

Introduction ​

Quick start ​

Integrating backends ​

CUDA ​

Metal ​

hipBLAS (ROCm) ​

Vulkan ​

SYCL ​

Troubleshooting ​

Related posts ​