llama.cpp

llama.cpp 的核心目标是：通过极简的配置，在从本地到云端的各类硬件上，实现极致性能的大语言模型（LLM）推理。

什么是llama.cpp？

llama.cpp 是一个基于纯 C/C++ 实现的高性能大模型推理框架，它摒弃了繁琐的外部库依赖，支持在 CPU 和 GPU 上进行高效计算。通过其首创的 GGUF 格式与量化技术，它让原本臃肿的大模型能够在普通的个人电脑、Mac 甚至手机等消费级设备上流畅运行。

本文档将指引您快速上手 llama.cpp，带您高效完成环境搭建与模型运行。

克隆仓库

Device

git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp

编译 llama.cpp

安装编译工具

Device

sudo apt install cmake gcc g++ libcurl4-openssl-dev

进行编译

Device

cmake -B build
cmake --build build --config Release -j$(nproc)

ARMv9

对于采用 ARM-v9 架构的瑞莎星睿 O6 / O6N 设备，可以开启 armv9-a 和 KleidiAI 编译选项进行硬件级优化。

请使用 4aced7a commit

Device

git checkout 4aced7a
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release -j$(nproc)

硬件优化

Llama.cpp 已集成 Arm KleidiAI 库，该库针对 SME、I8MM 及点积加速等硬件特性，提供了深度优化的矩阵乘法内核。您可以通过构建选项 GGML_CPU_KLEIDIAI=ON 来启用此功能。

Device

cmake -B build -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release -j$(nproc)

快速上手

提示

推荐使用 python3.11 以上版本

接下来的操作步骤中使用的示例模型为 DeepSeek-R1-Distill-Qwen-1.5B 。

下载示例模型

请使用 git LFS 。

Device

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

模型转换

Device

cd llama.cpp
pip3 install -r ./requirements.txt
python3 convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-1.5B/

模型量化

Device

cd build/bin
./llama-quantize DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

可选的量化选项：

or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
or  IQ2_XXS :  2.06 bpw quantization
or  IQ2_XS  :  2.31 bpw quantization
or  IQ2_S   :  2.5  bpw quantization
or  IQ2_M   :  2.7  bpw quantization
or  IQ1_S   :  1.56 bpw quantization
or  IQ1_M   :  1.75 bpw quantization
or  TQ1_0   :  1.69 bpw ternarization
or  TQ2_0   :  2.06 bpw ternarization
or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
or  IQ3_XXS :  3.06 bpw quantization
or  IQ3_S   :  3.44 bpw quantization
or  IQ3_M   :  3.66 bpw quantization mix
or  Q3_K    : alias for Q3_K_M
or  IQ3_XS  :  3.3 bpw quantization
or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
or  IQ4_NL  :  4.50 bpw non-linear quantization
or  IQ4_XS  :  4.25 bpw non-linear quantization
or  Q4_K    : alias for Q4_K_M
or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
or  Q5_K    : alias for Q5_K_M
or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

模型验证

Device

cd build/bin
./llama-cli -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

模型运行效果：

> hi, who are you
<think>

</think>

Hi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.

模型测试

Device

./llama-bench -m DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

radxa@orion-o6:~/llama.cpp/build/bin$ ./llama-bench -m ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -t 8
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         pp512 |         64.60 ± 0.27 |
| qwen2 1.5B Q4_K - Medium       |   1.04 GiB |     1.78 B | CPU        |       8 |         tg128 |         36.29 ± 0.16 |

参考信息

更多关于 llama.cpp 的详细资料，请参考官方文档

克隆仓库​

编译 llama.cpp​

安装编译工具​

进行编译​

快速上手​

下载示例模型​

模型转换​

模型量化​

模型验证​

模型测试​

参考信息​

克隆仓库

编译 llama.cpp

安装编译工具

进行编译

快速上手

下载示例模型

模型转换

模型量化

模型验证

模型测试

参考信息