Llama.cpp

The primary goal of llama.cpp is to enable LLM inference on various hardware (both local and cloud) with minimal setup and state-of-the-art performance.

Clone the Repository

git clone https://github.com/ggml-org/llama.cpp.git

Compile llama.cpp

Install Build Tools

sudo apt install cmake gcc g++

Build the Project

cmake -B build
cmake --build build --config Release

tip

If you are using the Radxa Orion O6 with an ARM-v9 CPU, you can add the armv9-a compile option for hardware-level optimization:

cmake -B build -DCMAKE_CXX_FLAGS="-march=armv9-a" -DCMAKE_C_FLAGS="-march=armv9-a"
cmake --build build --config Release

tip

Llama.cpp integrates Arm's KleidiAI library, which provides optimized matrix multiplication kernels for hardware features like sme, i8mm, and dot-product acceleration. You can enable this feature using the GGML_CPU_KLEIDIAI build option:

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

Usage

GGUF Model Conversion

tip

Here, we take DeepSeek-R1-Distill-Qwen-1.5B as an example.

Download the Hugging Face Model

Use git LFS to clone the repository:

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Generate GGUF Model

tip

It is recommended to use Python 3.11 or later.

cd llama.cpp
pip3 install -r ./requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B/

Quantize the Model

cd build/bin
./llama-quantize DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

Available quantization types:

or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
or  IQ2_XXS :  2.06 bpw quantization
or  IQ2_XS  :  2.31 bpw quantization
or  IQ2_S   :  2.5  bpw quantization
or  IQ2_M   :  2.7  bpw quantization
or  IQ1_S   :  1.56 bpw quantization
or  IQ1_M   :  1.75 bpw quantization
or  TQ1_0   :  1.69 bpw ternarization
or  TQ2_0   :  2.06 bpw ternarization
or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
or  IQ3_XXS :  3.06 bpw quantization
or  IQ3_S   :  3.44 bpw quantization
or  IQ3_M   :  3.66 bpw quantization mix
or  Q3_K    : alias for Q3_K_M
or  IQ3_XS  :  3.3 bpw quantization
or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
or  IQ4_NL  :  4.50 bpw non-linear quantization
or  IQ4_XS  :  4.25 bpw non-linear quantization
or  Q4_K    : alias for Q4_K_M
or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
or  Q5_K    : alias for Q5_K_M
or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

Run GGUF Model

cd build/bin
./llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

> hi, who are you
<think>

</think>

Hi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.

GGUF Benchmark Test

./llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

radxa@orion-o6:~/llama.cpp/build/bin$ ./llama-bench -m ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -t 8
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         pp512 |         64.60 ± 0.27 |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         tg128 |         36.29 ± 0.16 |

References

For more details on llama.cpp, please refer to the official documentation.

Clone the Repository​

Compile llama.cpp​

Install Build Tools​

Build the Project​

Usage​

GGUF Model Conversion​

Download the Hugging Face Model​

Generate GGUF Model​

Quantize the Model​

Run GGUF Model​

GGUF Benchmark Test​

References​