ERNIE 4.5-0.3B
This document explains how to run Baidu ERNIE models on the Radxa Orion O6 / O6N using llama.cpp with KleidiAI acceleration:
ERNIE-4.5-0.3B and
ERNIE-4.5-0.3B-Base.
Model links:
Download the model
Radxa provides pre-built GGUF files:
ERNIE-4.5-0.3B-PT-Q4_0.gguf and
ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf.
You can download them with modelscope:
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-PT-Q4_0.gguf
pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf
Convert the model (optional)
If you want to convert the model to GGUF yourself, follow this section on an x86 host.
Otherwise, download the pre-built GGUF from Radxa and skip to Inference.
Build llama.cpp
Build llama.cpp on an x86 host.
Follow llama.cpp to build llama.cpp on an x86 host.
Build command:
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release
Download the source model
Use modelscope to download the original model:
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-PT --local_dir ./ERNIE-4.5-0.3B-PT
pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-Base-PT --local_dir ./ERNIE-4.5-0.3B-Base-PT
Convert to a float (F16) GGUF
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-PT
cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-Base-PT
Running convert_hf_to_gguf.py generates an F16 (float) GGUF file in the model directory.
Quantize the GGUF
Use llama-quantize to quantize the float GGUF to Q4_0:
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-F16.gguf ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-Q4_0.gguf Q4_0
cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-F16.gguf ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf Q4_0
Running llama-quantize generates a GGUF file with the selected quantization format in the target path.
Inference
Build llama.cpp
Follow llama.cpp to build llama.cpp with KleidiAI enabled on the Radxa Orion O6 / O6N.
Build command:
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release
Run inference
Use llama-cli to chat with the model:
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7406-4aced7a63
model : ERNIE-4.5-0.3B-PT-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> What is relativity?
Relativity is a philosophical and scientific theory that describes how the laws of physics are relative to different reference frames. It's a way of thinking and studying phenomena that treats the motion of objects as a coordinate in a three-dimensional space of spacetime, and it explains how frames of reference can be relative to each other.
[ Prompt: 224.0 t/s | Generation: 45.9 t/s ]
cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7406-4aced7a63
model : ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> What is relativity?
Relativity is the scientific theory that explains the laws of physics that govern the behavior of matter and energy in the universe. It is a theory that explains the nature of space and time, which has implications for our understanding of the physical world and the laws of nature. Relativity is a fundamental concept in physics that describes the relationship between the speed of light in a vacuum and the speed of light in a medium. It also explains the behavior of objects in general relativity, which deals with the force of gravity and the curvature of space and time in general.
[ Prompt: 365.2 t/s | Generation: 43.3 t/s ]
Performance benchmarking
You can use llama-bench to benchmark the model.
- ERNIE-4.5-0.3B-PT
- ERNIE-4.5-0.3B-Base-PT
taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8
| Model | ernie4_5 0.3B Q4_0 |
|---|---|
| Size | 219.68 MiB |
| params | 360.75 M |
| backend | CPU |
| threads | 8 |
| n-prompt | n-gen | prefill t/s | generation t/s | prefill+generation t/s |
|---|---|---|---|---|
| 128 | 128 | 393.12 ± 3.11 | 78.56 ± 0.89 | 130.87 ± 1.04 |
| 512 | 512 | 439.33 ± 7.26 | 77.05 ± 0.23 | 116.79 ± 0.43 |
| 1024 | 1024 | 374.82 ± 2.67 | 70.65 ± 0.22 | 90.95 ± 0.35 |
| 2048 | 2048 | 293.03 ± 1.38 | 58.21 ± 0.09 | 66.94 ± 0.10 |
| 4096 | 4096 | 206.78 ± 0.28 | 45.48 ± 0.11 | 44.76 ± 0.03 |
taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8
| Model | ernie4_5 0.3B Base Q4_0 |
|---|---|
| Size | 219.68 MiB |
| params | 360.75 M |
| backend | CPU |
| threads | 8 |
| n-prompt | n-gen | prefill t/s | generation t/s | prefill+generation t/s |
|---|---|---|---|---|
| 128 | 128 | 405.01 ± 5.66 | 75.12 ± 0.74 | 126.65 ± 0.96 |
| 512 | 512 | 445.61 ± 6.44 | 73.82 ± 0.22 | 114.13 ± 0.14 |
| 1024 | 1024 | 384.32 ± 1.54 | 68.78 ± 0.27 | 90.95 ± 0.07 |
| 2048 | 2048 | 300.07 ± 1.51 | 57.33 ± 0.06 | 67.82 ± 0.08 |
| 4096 | 4096 | 207.03 ± 0.70 | 44.82 ± 0.13 | 44.59 ± 0.02 |