Appearance
OCI and Ampere optimize llama.cpp for enhanced performance
Explore the collaboration between OCI and Ampere that optimizes llama.cpp, leveraging CPU advancements to achieve faster and more efficient AI-driven computing.
Introduction
The collaboration between Oracle Cloud Infrastructure (OCI) and Ampere Computing has led to significant advancements in CPU-based language model inference. By optimizing llama.cpp for Ampere Altra ARM-based CPUs, they have achieved exceptional performance improvements, with Ampere's optimized version showing more than a 30% boost. This optimization, combined with the wide availability of Ampere shapes across OCI regions, ensures global accessibility and scalability for users.
Quantization advancements
Ampere has further enhanced the efficiency of language model inference on its hardware by introducing two new quantization methods in the optimized llama.cpp build:
Q4_K_4
Q8R16
These innovative quantization techniques offer significant benefits:
- They maintain model sizes and perplexity similar to their existing counterparts (
Q4_K
andQ8_0
, respectively). - They deliver up to 1.5-2x faster inference performance compared to the previous methods.
By implementing these new quantization methods, Ampere aims to substantially improve the speed and efficiency of language model inference on its hardware. This advancement not only enhances the performance of AI applications but also contributes to more efficient resource utilization in AI-driven computing environments.
Setting up the environment
To get started with llama.cpp on Ampere hardware, use the following Docker command to run a container with the necessary environment:
sh
docker run --privileged=true --entrypoint /bin/bash -v /path/to/models:/models -it amperecomputingai/llama.cpp:latest
This command runs a Docker container with privileged access, mounts a local directory for models, and opens an interactive bash session.
Next, download the Llama 3.2 1B Instruct model using the Hugging Face CLI:
bash
huggingface-cli download AmpereComputing/llama-3.2-1b-instruct-gguf Llama-3.2-1B-Instruct-Q8R16.gguf --local-dir /models
For more information on available Docker images, visit: https://hub.docker.com/r/amperecomputingai/llama.cpp
Running llama.cpp
Starting the llama.cpp server
To start the llama.cpp server, use the following Docker command with the llama-server
entrypoint:
sh
docker run -v /path/to/models:/models amperecomputingai/llama.cpp:latest -m /models/Llama-3.2-1B-Instruct-Q8R16.gguf
This command starts the llama.cpp server using the specified model file.
Running inference
To run inference using the llama.cpp CLI, use the following command:
sh
docker run --entrypoint /llm/llama-cli -v /path/to/models:/models -it amperecomputingai/llama.cpp:latest -m /models/Llama-3.2-1B-Instruct-Q8R16.gguf -p "10 tips on how to get good at rollerskating:\n"
This command runs the llama.cpp CLI with a specific prompt to generate text based on the given input.
Using Ollama
Ollama is a powerful tool for running and managing large language models locally. Ampere Computing has optimized Ollama specifically for their ARM-based CPUs, offering enhanced performance for AI workloads.
The ghcr.io/amperecomputingai/ollama-ampere
Docker image is a specialized version of Ollama that's been optimized for Ampere's ARM-based processors. This image is available in two variants:
-ol9
: Based on Oracle Linux 9-ub22
: Based on Ubuntu 22.04 LTS
To start an Ollama server using the Ampere-optimized image, use the following command:
sh
docker run -d -v ollama:/root/.ollama -p 11434:11434 ghcr.io/amperecomputingai/ollama-ampere:0.0.6-ub22
This command does the following:
- Runs the container in detached mode (
-d
) - Mounts a Docker volume named
ollama
to/root/.ollama
in the container, ensuring persistence of models and configurations - Maps port
11434
from the container to the host, allowing access to the Ollama API
After starting the Ollama server, you can interact with it using the Ollama CLI or API, just as you would with the standard Ollama installation. However, you'll benefit from the performance optimizations specific to Ampere hardware.
For example, to run a model:
sh
docker exec -it <container_id> ollama run gemma2:2b
Replace <container_id>
with the actual ID of your running Ollama container.
For more information on available Ollama images, visit: https://github.com/orgs/AmpereComputingAI/packages/container/package/ollama-ampere
By using the Ampere-optimized Ollama image, you can leverage the full potential of Ampere's ARM-based CPUs for running large language models, achieving better performance and efficiency in your AI applications.