Skip to main content

ERNIE 4.5-0.3B

This document explains how to run Baidu ERNIE models on the Radxa Orion O6 / O6N using llama.cpp with KleidiAI acceleration: ERNIE-4.5-0.3B and ERNIE-4.5-0.3B-Base.

Model links:

Download the model

Radxa provides pre-built GGUF files: ERNIE-4.5-0.3B-PT-Q4_0.gguf and ERNIE-4.5-0.3B-Base-PT-Q4_0.gguf. You can download them with modelscope:

Device
pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-0.3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-0.3B-PT-Q4_0.gguf

Convert the model (optional)

tip

If you want to convert the model to GGUF yourself, follow this section on an x86 host.

Otherwise, download the pre-built GGUF from Radxa and skip to Inference.

Build llama.cpp

Build llama.cpp on an x86 host.

tip

Follow llama.cpp to build llama.cpp on an x86 host.

Build command:

X86 PC
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release

Download the source model

Use modelscope to download the original model:

X86 PC
pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-0.3B-PT --local_dir ./ERNIE-4.5-0.3B-PT

Convert to a float (F16) GGUF

X86 PC
cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-0.3B-PT

Running convert_hf_to_gguf.py generates an F16 (float) GGUF file in the model directory.

Quantize the GGUF

Use llama-quantize to quantize the float GGUF to Q4_0:

X86 PC
cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-F16.gguf ERNIE-4.5-0.3B-PT/ERNIE-4.5-0.3B-PT-Q4_0.gguf Q4_0

Running llama-quantize generates a GGUF file with the selected quantization format in the target path.

Inference

Build llama.cpp

tip

Follow llama.cpp to build llama.cpp with KleidiAI enabled on the Radxa Orion O6 / O6N.

Build command:

Device
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

Run inference

Use llama-cli to chat with the model:

Device
cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-0.3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7406-4aced7a63
model : ERNIE-4.5-0.3B-PT-Q4_0.gguf
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file


> What is relativity?

Relativity is a philosophical and scientific theory that describes how the laws of physics are relative to different reference frames. It's a way of thinking and studying phenomena that treats the motion of objects as a coordinate in a three-dimensional space of spacetime, and it explains how frames of reference can be relative to each other.

[ Prompt: 224.0 t/s | Generation: 45.9 t/s ]

Performance benchmarking

You can use llama-bench to benchmark the model.

Device
taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-0.3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8
Modelernie4_5 0.3B Q4_0
Size219.68 MiB
params360.75 M
backendCPU
threads8
n-promptn-genprefill t/sgeneration t/sprefill+generation t/s
128128393.12 ± 3.1178.56 ± 0.89130.87 ± 1.04
512512439.33 ± 7.2677.05 ± 0.23116.79 ± 0.43
10241024374.82 ± 2.6770.65 ± 0.2290.95 ± 0.35
20482048293.03 ± 1.3858.21 ± 0.0966.94 ± 0.10
40964096206.78 ± 0.2845.48 ± 0.1144.76 ± 0.03

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0