OCI and Ampere optimize `llama.cpp` for enhanced performance

Explore the collaboration between OCI and Ampere that optimizes llama.cpp, leveraging CPU advancements to achieve faster and more efficient AI-driven computing.

Introduction

The collaboration between Oracle Cloud Infrastructure (OCI) and Ampere Computing has led to significant advancements in CPU-based language model inference. By optimizing llama.cpp for Ampere Altra ARM-based CPUs, they have achieved exceptional performance improvements, with Ampere's optimized version showing more than a 30% boost. This optimization, combined with the wide availability of Ampere shapes across OCI regions, ensures global accessibility and scalability for users.

Quantization advancements

Ampere has further enhanced the efficiency of language model inference on its hardware by introducing two new quantization methods in the optimized llama.cpp build:

Q4_K_4
Q8R16

These innovative quantization techniques offer significant benefits:

They maintain model sizes and perplexity similar to their existing counterparts (Q4_K and Q8_0, respectively).
They deliver up to 1.5-2x faster inference performance compared to the previous methods.

By implementing these new quantization methods, Ampere aims to substantially improve the speed and efficiency of language model inference on its hardware. This advancement not only enhances the performance of AI applications but also contributes to more efficient resource utilization in AI-driven computing environments.

Setting up the environment

To get started with llama.cpp on Ampere hardware, use the following Docker command to run a container with the necessary environment:

docker run --privileged=true --entrypoint /bin/bash -v /path/to/models:/models -it amperecomputingai/llama.cpp:latest

This command runs a Docker container with privileged access, mounts a local directory for models, and opens an interactive bash session.

Next, download the Llama 3.2 1B Instruct model using the Hugging Face CLI:

bash

huggingface-cli download AmpereComputing/llama-3.2-1b-instruct-gguf Llama-3.2-1B-Instruct-Q8R16.gguf --local-dir /models

For more information on available Docker images, visit: https://hub.docker.com/r/amperecomputingai/llama.cpp

Running `llama.cpp`

Starting the `llama.cpp` server

To start the llama.cpp server, use the following Docker command with the llama-server entrypoint:

docker run -v /path/to/models:/models amperecomputingai/llama.cpp:latest -m /models/Llama-3.2-1B-Instruct-Q8R16.gguf

This command starts the llama.cpp server using the specified model file.

Running inference

To run inference using the llama.cpp CLI, use the following command:

docker run --entrypoint /llm/llama-cli -v /path/to/models:/models -it amperecomputingai/llama.cpp:latest -m /models/Llama-3.2-1B-Instruct-Q8R16.gguf -p "10 tips on how to get good at rollerskating:\n"

This command runs the llama.cpp CLI with a specific prompt to generate text based on the given input.

Using Ollama

Ollama is a powerful tool for running and managing large language models locally. Ampere Computing has optimized Ollama specifically for their ARM-based CPUs, offering enhanced performance for AI workloads.

The ghcr.io/amperecomputingai/ollama-ampere Docker image is a specialized version of Ollama that's been optimized for Ampere's ARM-based processors. This image is available in two variants:

-ol9: Based on Oracle Linux 9
-ub22: Based on Ubuntu 22.04 LTS

To start an Ollama server using the Ampere-optimized image, use the following command:

docker run -d -v ollama:/root/.ollama -p 11434:11434 ghcr.io/amperecomputingai/ollama-ampere:0.0.6-ub22

This command does the following:

Runs the container in detached mode (-d)
Mounts a Docker volume named ollama to /root/.ollama in the container, ensuring persistence of models and configurations
Maps port 11434 from the container to the host, allowing access to the Ollama API

After starting the Ollama server, you can interact with it using the Ollama CLI or API, just as you would with the standard Ollama installation. However, you'll benefit from the performance optimizations specific to Ampere hardware.

For example, to run a model:

docker exec -it <container_id> ollama run gemma2:2b

Replace <container_id> with the actual ID of your running Ollama container.

For more information on available Ollama images, visit: https://github.com/orgs/AmpereComputingAI/packages/container/package/ollama-ampere

By using the Ampere-optimized Ollama image, you can leverage the full potential of Ampere's ARM-based CPUs for running large language models, achieving better performance and efficiency in your AI applications.

OCI and Ampere optimize llama.cpp for enhanced performance ​

Introduction ​

Quantization advancements ​

Setting up the environment ​

Running llama.cpp ​

Starting the llama.cpp server ​

Running inference ​

Using Ollama ​

Related posts ​