ERNIE 4.5-21B-A3B
This document explains how to run Baidu ERNIE models on the Radxa Orion O6 / O6N using llama.cpp with KleidiAI acceleration:
ERNIE-4.5-21B-A3B and
ERNIE-4.5-21B-A3B-Base.
Model links:
Download the model
Radxa provides pre-built GGUF files:
ERNIE-4.5-21B-A3B-PT-Q4_0.gguf and
ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf.
You can download them with modelscope:
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-21B-A3B-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-21B-A3B-PT-Q4_0.gguf
pip3 install modelscope
modelscope download --model radxa/ERNIE-4.5-GGUF ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf --local_dir ./ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf
Convert the model (optional)
If you want to convert the model to GGUF yourself, follow this section on an x86 host.
Otherwise, download the pre-built GGUF from Radxa and skip to Inference.
Build llama.cpp
Build llama.cpp on an x86 host.
Follow llama.cpp to build llama.cpp on an x86 host.
Build command:
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build
cmake --build build --config Release
Download the source model
Use modelscope to download the original model:
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-21B-A3B-PT --local_dir ./ERNIE-4.5-21B-A3B-PT
pip3 install modelscope
modelscope download --model PaddlePaddle/ERNIE-4.5-21B-A3B-Base-PT --local_dir ./ERNIE-4.5-21B-A3B-Base-PT
Convert to a float (F16) GGUF
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-21B-A3B-PT
cd llama.cpp
python3 convert_hf_to_gguf.py ./ERNIE-4.5-21B-A3B-Base-PT
Running convert_hf_to_gguf.py generates an F16 (float) GGUF file in the model directory.
Quantize the GGUF
Use llama-quantize to quantize the float GGUF to Q4_0:
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-21B-A3B-PT/ERNIE-4.5-21B-A3B-PT-F16.gguf ERNIE-4.5-21B-A3B-PT/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf Q4_0
cd llama.cpp
./build/bin/llama-quantize ERNIE-4.5-21B-A3B-Base-PT/ERNIE-4.5-21B-A3B-Base-PT-F16.gguf ERNIE-4.5-21B-A3B-Base-PT/ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf Q4_0
Running llama-quantize generates a GGUF file with the selected quantization format in the target path.
Inference
Build llama.cpp
Follow llama.cpp to build llama.cpp with KleidiAI enabled on the Radxa Orion O6 / O6N.
Build command:
sudo apt install cmake gcc g++
git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON
cmake --build build --config Release
Run inference
Use llama-cli to chat with the model:
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7406-4aced7a63
model : ERNIE-4.5-21B-A3B-PT-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> What is relativity?
**Relativity** is a foundational theory in physics developed by Albert Einstein, primarily consisting of two parts: **special relativity** (1905) and **general relativity** (1915). It revolutionized our understanding of space, time, and gravity, challenging classical Newtonian physics.
### **1. Special Relativity**
- **Key Idea**: Physics laws are the same for all non-accelerating observers, regardless of their motion.
- **Postulates**:
1. **Principle of Relativity**: Physical laws are identical in all inertial frames.
2. **Speed of Light**: The speed of light in a vacuum (*c* ≈ 299,792 km/s) is constant and does not depend on the motion of the light source or observer.
- **Consequences**:
- **Time Dilation**: Time slows down for objects moving at relativistic speeds (close to *c*). For example, a clock on a fast-moving train ticks slower than one on Earth.
- **Length Contraction**: Objects appear shorter along the direction of motion when moving at high speeds.
- **Mass-Energy Equivalence**: *E = mc²*—energy (*E*) and mass (*m*) are interchangeable, explaining nuclear reactions.
- **Relativistic Momentum**: Momentum depends on velocity, not just speed.
### **2. General Relativity**
- **Key Idea**: Gravity is not a force but the curvature of spacetime caused by mass and energy.
- **Postulates**:
- **Equivalence Principle**: A local inertial frame (free-falling) is indistinguishable from one without gravity.
- **Spacetime Curvature**: Massive objects like planets warp spacetime, causing objects to follow curved paths (e.g., orbits).
- **Consequences**:
- **Gravitational Time Dilation**: Clocks run slower in stronger gravitational fields (e.g., near Earth’s surface vs. orbit).
- **Light Bending**: Light curves around massive objects due to spacetime curvature (confirmed by Eddington’s 1919 eclipse experiment).
- **Black Holes**: Extreme curvature traps light and matter, creating regions where nothing escapes.
- **Expanding Universe**: General relativity explains the universe’s expansion, leading to the Big Bang theory.
### **Applications and Impact**
- **Technology**: GPS systems rely on corrections for both special relativity (time dilation) and general relativity (gravity’s effect on time).
- **Cosmology**: Predicts black holes, neutron stars, and the universe’s evolution.
- **Fundamental Physics**: Unifies with quantum mechanics in attempts to explain the universe’s origin (e.g., string theory, loop quantum gravity).
### **Why It Matters**
Relativity reshaped modern physics by showing that space, time, and gravity are interconnected. It replaced Newton’s absolute space and time with a dynamic, relative framework, providing a more accurate description of the cosmos at both microscopic and cosmic scales.
In short, relativity is the science of how space, time, and energy influence each other, reshaping our understanding of reality.
[ Prompt: 18.6 t/s | Generation: 7.0 t/s ]
cd llama.cpp
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
(base) rock@orion-o6:~/baidu/llama.cpp/build/bin$ taskset -c 0,5,6,7,8,9,10,11 ./llama-cli -m ../../../gguf/ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -c 4096 -t 8 --conversation --jinja
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7406-4aced7a63
model : ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> What is relativity?
Relativity is a term used in physics to describe the theory of relativity, which was developed by Albert Einstein in the early 20th century. The theory of relativity is based on the idea that the laws of physics are the same for all observers, regardless of their relative motion. This means that the laws of physics, such as the laws of motion and the laws of gravity, apply equally to all observers, whether they are stationary or moving at a constant velocity.
The theory of relativity has two main branches: special relativity and general relativity. Special relativity deals with the behavior of objects moving at constant velocities, while general relativity deals with the behavior of objects in the presence of gravity.
The theory of relativity has had a profound impact on our understanding of the universe, including the discovery of black holes, the expansion of the universe, and the existence of gravitational waves. It has also led to the development of new technologies, such as GPS, which rely on the principles of relativity to function accurately.
[ Prompt: 18.5 t/s | Generation: 7.6 t/s ]
Performance benchmarking
You can use llama-bench to benchmark the model.
- ERNIE-4.5-21B-A3B-PT
- ERNIE-4.5-21B-A3B-Base-PT
taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-21B-A3B-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8
| Model | ernie4_5-moe 21B.A3B Q4_0 |
|---|---|
| Size | 11.51 GiB |
| params | 21.83 B |
| backend | CPU |
| threads | 8 |
| n-prompt | n-gen | prefill t/s | generation t/s | prefill+generation |
|---|---|---|---|---|
| 128 | 128 | 33.96 ± 0.35 | 9.75 ± 0.01 | 15.15 ± 0.04 |
| 512 | 512 | 36.30 ± 0.11 | 9.67 ± 0.02 | 14.69 ± 0.01 |
| 1024 | 1024 | 35.25 ± 0.04 | 9.38 ± 0.01 | 13.76 ± 0.01 |
| 2048 | 2048 | 33.59 ± 0.06 | 8.89 ± 0.01 | 12.28 ± 0.01 |
| 4096 | 4096 | 30.79 ± 0.02 | 8.15 ± 0.02 | 10.21 ± 0.02 |
taskset -c 0,5,6,7,8,9,10,11 ./llama-bench -m ERNIE-4.5-21B-A3B-Base-PT-Q4_0.gguf -p 128 -n 128 -pg 128,128 -t 8
| Model | ernie4_5-moe 21B.A3B Base Q4_0 |
|---|---|
| Size | 11.51 GiB |
| params | 21.83 B |
| backend | CPU |
| threads | 8 |
| n-prompt | n-gen | prefill t/s | generation t/s | prefill+generation t/s |
|---|---|---|---|---|
| 128 | 128 | 34.25 ± 0.21 | 9.79 ± 0.02 | 15.21 ± 0.03 |
| 512 | 512 | 36.31 ± 0.15 | 9.63 ± 0.01 | 14.70 ± 0.08 |
| 1024 | 1024 | 35.51 ± 0.08 | 9.42 ± 0.01 | 13.79 ± 0.02 |
| 2048 | 2048 | 33.73 ± 0.04 | 8.89 ± 0.01 | 12.29 ± 0.01 |
| 4096 | 4096 | 30.79 ± 0.06 | 8.13 ± 0.01 | 10.21 ± 0.00 |