ERNIE 4.5-0.3B

本文档讲述如何在瑞莎星睿 O6 / O6N 上使用 llama.cpp 启用 KleidiAI 加速推理百度文心一言 ERNIE-4.5-0.3B 与 ERNIE-4.5-0.3B-Base 模型。

模型地址：

模型下载

radxa 提供预编译好的 ERNIE-4.5-0.3B-PT-Q4_0.gguf 与 ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf 模型，用户可以使用 modelscope 进行下载

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-PT-Q4_0.gguf

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf

模型转换

提示

如用户对转换 GGUF 模型感兴趣，可以参考本节内容在 X86 主机上进行模型转换，

如不想进行模型转换可以下载 radxa 提供的 GGUF 模型然后跳到 模型推理

编译 llama.cpp

在 X86 主机上编译 llama.cpp

提示

请根据 llama.cpp 在 X86 主机上编译带 llama.cpp

以下为编译命令

X86 PC

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release

下载模型

请使用 modelscope 下载源模型

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-PT --local_dir ./ERNIE-4.5-0.3B-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-Base-PT --local_dir ./ERNIE-4.5-0.3B-Base-PT

转换为浮点 GGUF 格式模型

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-Base-PT

执行 convert_hf_to_gguf.py 会在源模型目录下生成一个 F16 的浮点 GGUF 模型

量化 GGUF 模型

使用 llama-quantize 工具对浮点 GGUF 模型进行 Q4_0 量化

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-F16.gguf ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-Q4_0.gguf Q4_0

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-F16.gguf ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf Q4_0

执行 llama-quantize 会在指定目录下生成一个特定量化方式的 GGUF 模型

模型推理

编译 llama.cpp

提示

请根据 llama.cpp 在瑞莎星睿 O6/O6N 上编译带 KleidiAI 特性的 llama.cpp

以下为编译命令

Device

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

推理模型

这里使用 llama-cli 进行模型对话

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-0.3B-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

Relativity is a philosophical and scientific theory that describes how the laws of physics are relative to different reference frames. It's a way of thinking and studying phenomena that treats the motion of objects as a coordinate in a three-dimensional space of spacetime, and it explains how frames of reference can be relative to each other.

[ Prompt: 224.0 t/s | Generation: 45.9 t/s ]

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

Relativity is the scientific theory that explains the laws of physics that govern the behavior of matter and energy in the universe. It is a theory that explains the nature of space and time, which has implications for our understanding of the physical world and the laws of nature. Relativity is a fundamental concept in physics that describes the relationship between the speed of light in a vacuum and the speed of light in a medium. It also explains the behavior of objects in general relativity, which deals with the force of gravity and the curvature of space and time in general.

[ Prompt: 365.2 t/s | Generation: 43.3 t/s ]

性能分析

可以使用 llama-bench 工具对模型进行性能分析

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5 0.3B Q4_0
Size	219.68 MiB
params	360.75 M
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation t/s
128	128	393.12 ± 3.11	78.56 ± 0.89	130.87 ± 1.04
512	512	439.33 ± 7.26	77.05 ± 0.23	116.79 ± 0.43
1024	1024	374.82 ± 2.67	70.65 ± 0.22	90.95 ± 0.35
2048	2048	293.03 ± 1.38	58.21 ± 0.09	66.94 ± 0.10
4096	4096	206.78 ± 0.28	45.48 ± 0.11	44.76 ± 0.03

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5 0.3B Base Q4_0
Size	219.68 MiB
params	360.75 M
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation t/s
128	128	405.01 ± 5.66	75.12 ± 0.74	126.65 ± 0.96
512	512	445.61 ± 6.44	73.82 ± 0.22	114.13 ± 0.14
1024	1024	384.32 ± 1.54	68.78 ± 0.27	90.95 ± 0.07
2048	2048	300.07 ± 1.51	57.33 ± 0.06	67.82 ± 0.08
4096	4096	207.03 ± 0.70	44.82 ± 0.13	44.59 ± 0.02

模型下载​

模型转换​

编译 llama.cpp​

下载模型​

转换为浮点 GGUF 格式模型​

量化 GGUF 模型​

模型推理​

编译 llama.cpp​

推理模型​

性能分析​

模型下载

模型转换

编译 llama.cpp

下载模型

转换为浮点 GGUF 格式模型

量化 GGUF 模型

模型推理

编译 llama.cpp

推理模型

性能分析