ERNIE 4.5-0.3B

This document explains how to run Baidu ERNIE models on the Radxa Orion O6 / O6N using llama.cpp with KleidiAI acceleration: ERNIE-4.5-0.3B and ERNIE-4.5-0.3B-Base.

Model links:

Download the model

Radxa provides pre-built GGUF files: ERNIE-4.5-0.3B-PT-Q4_0.gguf and ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf. You can download them with modelscope:

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-PT-Q4_0.gguf

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf

Convert the model (optional)

tip

If you want to convert the model to GGUF yourself, follow this section on an x86 host.

Otherwise, download the pre-built GGUF from Radxa and skip to Inference.

Build llama.cpp

Build llama.cpp on an x86 host.

tip

Follow llama.cpp to build llama.cpp on an x86 host.

Build command:

X86 PC

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release

Download the source model

Use modelscope to download the original model:

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-PT --local_dir ./ERNIE-4.5-0.3B-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-Base-PT --local_dir ./ERNIE-4.5-0.3B-Base-PT

Convert to a float (F16) GGUF

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-Base-PT

Running convert_hf_to_gguf.py generates an F16 (float) GGUF file in the model directory.

Quantize the GGUF

Use llama-quantize to quantize the float GGUF to Q4_0:

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-F16.gguf ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-Q4_0.gguf Q4_0

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-F16.gguf ERNIE-4.5-0.3B-Base-PT/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf Q4_0

Running llama-quantize generates a GGUF file with the selected quantization format in the target path.

Inference

Build llama.cpp

tip

Follow llama.cpp to build llama.cpp with KleidiAI enabled on the Radxa Orion O6 / O6N.

Build command:

Device

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

Run inference

Use llama-cli to chat with the model:

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-0.3B-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

Relativity is a philosophical and scientific theory that describes how the laws of physics are relative to different reference frames. It's a way of thinking and studying phenomena that treats the motion of objects as a coordinate in a three-dimensional space of spacetime, and it explains how frames of reference can be relative to each other.

[ Prompt: 224.0 t/s | Generation: 45.9 t/s ]

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

Relativity is the scientific theory that explains the laws of physics that govern the behavior of matter and energy in the universe. It is a theory that explains the nature of space and time, which has implications for our understanding of the physical world and the laws of nature. Relativity is a fundamental concept in physics that describes the relationship between the speed of light in a vacuum and the speed of light in a medium. It also explains the behavior of objects in general relativity, which deals with the force of gravity and the curvature of space and time in general.

[ Prompt: 365.2 t/s | Generation: 43.3 t/s ]

Performance benchmarking

You can use llama-bench to benchmark the model.

ERNIE-4.5-0.3B-PT
ERNIE-4.5-0.3B-Base-PT

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5 0.3B Q4_0
Size	219.68 MiB
params	360.75 M
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation t/s
128	128	393.12 ± 3.11	78.56 ± 0.89	130.87 ± 1.04
512	512	439.33 ± 7.26	77.05 ± 0.23	116.79 ± 0.43
1024	1024	374.82 ± 2.67	70.65 ± 0.22	90.95 ± 0.35
2048	2048	293.03 ± 1.38	58.21 ± 0.09	66.94 ± 0.10
4096	4096	206.78 ± 0.28	45.48 ± 0.11	44.76 ± 0.03

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5 0.3B Base Q4_0
Size	219.68 MiB
params	360.75 M
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation t/s
128	128	405.01 ± 5.66	75.12 ± 0.74	126.65 ± 0.96
512	512	445.61 ± 6.44	73.82 ± 0.22	114.13 ± 0.14
1024	1024	384.32 ± 1.54	68.78 ± 0.27	90.95 ± 0.07
2048	2048	300.07 ± 1.51	57.33 ± 0.06	67.82 ± 0.08
4096	4096	207.03 ± 0.70	44.82 ± 0.13	44.59 ± 0.02

Download the model​

Convert the model (optional)​

Build llama.cpp​

Download the source model​

Convert to a float (F16) GGUF​

Quantize the GGUF​

Inference​

Build llama.cpp​

Run inference​

Performance benchmarking​

Download the model

Convert the model (optional)

Build llama.cpp

Download the source model

Convert to a float (F16) GGUF

Quantize the GGUF

Inference

Build llama.cpp

Run inference

Performance benchmarking