llama.cpp

The core goal of llama.cpp is to enable high-performance large language model (LLM) inference across a wide range of hardware, from local devices to the cloud, with minimal configuration.

What is llama.cpp?

llama.cpp is a high-performance LLM inference framework implemented in pure C/C++. It avoids heavy external dependencies and supports efficient computation on both CPUs and GPUs. With its GGUF format and quantization techniques, models that were previously too large can run smoothly on consumer devices such as PCs, Macs, and even phones.

This document will help you get started with llama.cpp quickly and complete environment setup and model execution efficiently.

Clone the repository

Device

git clone https://github.com/ggml-org/llama.cpp.git

Build llama.cpp

Install build tools

Device

sudo apt install cmake gcc g++ libcurl4-openssl-dev

Build

Device

cmake -B build
cmake --build build --config Release -j$(nproc)

ARMv9

On devices using the ARMv9 architecture, such as Radxa Orion O6 / O6N, you can enable the armv9-a and KleidiAI build options for hardware-level optimization.

Please use 4aced7a commit.

Device

git checkout 4aced7a
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release -j$(nproc)

Hardware optimization

llama.cpp integrates the Arm KleidiAI library, which provides highly optimized matrix-multiplication kernels for hardware features such as SME, I8MM, and dot-product acceleration. You can enable this feature with the build option GGML_CPU_KLEIDIAI=ON.

Device

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release -j$(nproc)

Quick Start

tip

Python 3.11 or later is recommended.

In the following steps, the example model is DeepSeek-R1-Distill-Qwen-1.5B.

Download the example model

Make sure you have git LFS installed.

Device

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Convert the model

Device

cd llama.cpp
pip3 install -r ./requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B/

Quantize the model

Device

cd build/bin
./llama-quantize DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

Optional quantization options:

or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
or  IQ2_XXS :  2.06 bpw quantization
or  IQ2_XS  :  2.31 bpw quantization
or  IQ2_S   :  2.5  bpw quantization
or  IQ2_M   :  2.7  bpw quantization
or  IQ1_S   :  1.56 bpw quantization
or  IQ1_M   :  1.75 bpw quantization
or  TQ1_0   :  1.69 bpw ternarization
or  TQ2_0   :  2.06 bpw ternarization
or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
or  IQ3_XXS :  3.06 bpw quantization
or  IQ3_S   :  3.44 bpw quantization
or  IQ3_M   :  3.66 bpw quantization mix
or  Q3_K    : alias for Q3_K_M
or  IQ3_XS  :  3.3 bpw quantization
or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
or  IQ4_NL  :  4.50 bpw non-linear quantization
or  IQ4_XS  :  4.25 bpw non-linear quantization
or  Q4_K    : alias for Q4_K_M
or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
or  Q5_K    : alias for Q5_K_M
or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

Validate the model

Device

cd build/bin
./llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

Example output:

> hi, who are you
<think>

</think>

Hi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.

Benchmark the model

Device

./llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

radxa@orion-o6:~/llama.cpp/build/bin$ ./llama-bench -m ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -t 8
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         pp512 |         64.60 ± 0.27 |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         tg128 |         36.29 ± 0.16 |

References

For more details about llama.cpp, refer to the official documentation.

Clone the repository​

Build llama.cpp​

Install build tools​

Build​

Quick Start​

Download the example model​

Convert the model​

Quantize the model​

Validate the model​

Benchmark the model​

References​

Clone the repository

Build llama.cpp

Install build tools

Build

Quick Start

Download the example model

Convert the model

Quantize the model

Validate the model

Benchmark the model

References