ERNIE 4.5-21B-A3B

This document explains how to run Baidu ERNIE models on the Radxa Orion O6 / O6N using llama.cpp with KleidiAI acceleration: ERNIE-4.5-21B-A3B and ERNIE-4.5-21B-A3B-Base.

Model links:

Download the model

Radxa provides pre-built GGUF files: ERNIE-4.5-21B-A3B-PT-Q4_0.gguf and ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf. You can download them with modelscope:

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-21B-A3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-21B-A3B-PT-Q4_0.gguf

Device

pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf

Convert the model (optional)

tip

If you want to convert the model to GGUF yourself, follow this section on an x86 host.

Otherwise, download the pre-built GGUF from Radxa and skip to Inference.

Build llama.cpp

Build llama.cpp on an x86 host.

tip

Follow llama.cpp to build llama.cpp on an x86 host.

Build command:

X86 PC

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release

Download the source model

Use modelscope to download the original model:

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-21B-A3B-PT --local_dir ./ERNIE-4.5-21B-A3B-PT

X86 PC

pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-21B-A3B-Base-PT --local_dir ./ERNIE-4.5-21B-A3B-Base-PT

Convert to a float (F16) GGUF

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-21B-A3B-PT

X86 PC

cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-21B-A3B-Base-PT

Running convert_hf_to_gguf.py generates an F16 (float) GGUF file in the model directory.

Quantize the GGUF

Use llama-quantize to quantize the float GGUF to Q4_0:

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-21B-A3B-PT/ERNIE-4.5-21B-A3B-PT-F16.gguf ERNIE-4.5-21B-A3B-PT/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf Q4_0

X86 PC

cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-21B-A3B-Base-PT/ERNIE-4.5-21B-A3B-Base-PT-F16.gguf ERNIE-4.5-21B-A3B-Base-PT/ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf Q4_0

Running llama-quantize generates a GGUF file with the selected quantization format in the target path.

Inference

Build llama.cpp

tip

Follow llama.cpp to build llama.cpp with KleidiAI enabled on the Radxa Orion O6 / O6N.

Build command:

Device

sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release

Run inference

Use llama-cli to chat with the model:

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-21B-A3B-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

 **Relativity** is a foundational theory in physics developed by Albert Einstein, primarily consisting of two parts: **special relativity** (1905) and **general relativity** (1915). It revolutionized our understanding of space, time, and gravity, challenging classical Newtonian physics.

### **1. Special Relativity**
- **Key Idea**: Physics laws are the same for all non-accelerating observers, regardless of their motion.
- **Postulates**:
  1. **Principle of Relativity**: Physical laws are identical in all inertial frames.
  2. **Speed of Light**: The speed of light in a vacuum (*c* ≈ 299,792 km/s) is constant and does not depend on the motion of the light source or observer.
- **Consequences**:
  - **Time Dilation**: Time slows down for objects moving at relativistic speeds (close to *c*). For example, a clock on a fast-moving train ticks slower than one on Earth.
  - **Length Contraction**: Objects appear shorter along the direction of motion when moving at high speeds.
  - **Mass-Energy Equivalence**: *E = mc²*—energy (*E*) and mass (*m*) are interchangeable, explaining nuclear reactions.
  - **Relativistic Momentum**: Momentum depends on velocity, not just speed.

### **2. General Relativity**
- **Key Idea**: Gravity is not a force but the curvature of spacetime caused by mass and energy.
- **Postulates**:
  - **Equivalence Principle**: A local inertial frame (free-falling) is indistinguishable from one without gravity.
  - **Spacetime Curvature**: Massive objects like planets warp spacetime, causing objects to follow curved paths (e.g., orbits).
- **Consequences**:
  - **Gravitational Time Dilation**: Clocks run slower in stronger gravitational fields (e.g., near Earth’s surface vs. orbit).
  - **Light Bending**: Light curves around massive objects due to spacetime curvature (confirmed by Eddington’s 1919 eclipse experiment).
  - **Black Holes**: Extreme curvature traps light and matter, creating regions where nothing escapes.
  - **Expanding Universe**: General relativity explains the universe’s expansion, leading to the Big Bang theory.

### **Applications and Impact**
- **Technology**: GPS systems rely on corrections for both special relativity (time dilation) and general relativity (gravity’s effect on time).
- **Cosmology**: Predicts black holes, neutron stars, and the universe’s evolution.
- **Fundamental Physics**: Unifies with quantum mechanics in attempts to explain the universe’s origin (e.g., string theory, loop quantum gravity).

### **Why It Matters**
Relativity reshaped modern physics by showing that space, time, and gravity are interconnected. It replaced Newton’s absolute space and time with a dynamic, relative framework, providing a more accurate description of the cosmos at both microscopic and cosmic scales.

In short, relativity is the science of how space, time, and energy influence each other, reshaping our understanding of reality.

[ Prompt: 18.6 t/s | Generation: 7.0 t/s ]

Device

cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b7406-4aced7a63
model      : ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is relativity?

Relativity is a term used in physics to describe the theory of relativity, which was developed by Albert Einstein in the early 20th century. The theory of relativity is based on the idea that the laws of physics are the same for all observers, regardless of their relative motion. This means that the laws of physics, such as the laws of motion and the laws of gravity, apply equally to all observers, whether they are stationary or moving at a constant velocity.

The theory of relativity has two main branches: special relativity and general relativity. Special relativity deals with the behavior of objects moving at constant velocities, while general relativity deals with the behavior of objects in the presence of gravity.

The theory of relativity has had a profound impact on our understanding of the universe, including the discovery of black holes, the expansion of the universe, and the existence of gravitational waves. It has also led to the development of new technologies, such as GPS, which rely on the principles of relativity to function accurately.

[ Prompt: 18.5 t/s | Generation: 7.6 t/s ]

Performance benchmarking

You can use llama-bench to benchmark the model.

ERNIE-4.5-21B-A3B-PT
ERNIE-4.5-21B-A3B-Base-PT

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5-moe 21B.A3B Q4_0
Size	11.51 GiB
params	21.83 B
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation
128	128	33.96 ± 0.35	9.75 ± 0.01	15.15 ± 0.04
512	512	36.30 ± 0.11	9.67 ± 0.02	14.69 ± 0.01
1024	1024	35.25 ± 0.04	9.38 ± 0.01	13.76 ± 0.01
2048	2048	33.59 ± 0.06	8.89 ± 0.01	12.28 ± 0.01
4096	4096	30.79 ± 0.02	8.15 ± 0.02	10.21 ± 0.02

Device

taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8

Model	ernie4_5-moe 21B.A3B Base Q4_0
Size	11.51 GiB
params	21.83 B
backend	CPU
threads	8

n-prompt	n-gen	prefill t/s	generation t/s	prefill+generation t/s
128	128	34.25 ± 0.21	9.79 ± 0.02	15.21 ± 0.03
512	512	36.31 ± 0.15	9.63 ± 0.01	14.70 ± 0.08
1024	1024	35.51 ± 0.08	9.42 ± 0.01	13.79 ± 0.02
2048	2048	33.73 ± 0.04	8.89 ± 0.01	12.29 ± 0.01
4096	4096	30.79 ± 0.06	8.13 ± 0.01	10.21 ± 0.00

Download the model​

Convert the model (optional)​

Build llama.cpp​

Download the source model​

Convert to a float (F16) GGUF​

Quantize the GGUF​

Inference​

Build llama.cpp​

Run inference​

Performance benchmarking​

Download the model

Convert the model (optional)

Build llama.cpp

Download the source model

Convert to a float (F16) GGUF

Quantize the GGUF

Inference

Build llama.cpp

Run inference

Performance benchmarking