RKLLM DeepSeek-R1
DeepSeek-R1 is a state-of-the-art reasoning model developed by DeepSeek. DeepSeek has open-sourced the training approach and model weights, and its performance is competitive with closed-source reasoning models. DeepSeek also released multiple distilled open-source lightweight variants (covering the Qwen2.5 and Llama3.1 families) using knowledge distillation. This document demonstrates how to deploy the distilled DeepSeek-R1-Distill-Qwen-1.5B model to an RK3588 device with the RKLLM toolchain and run hardware-accelerated inference on the built-in NPU.

Quick Start
Download the demo
Download the complete demo from ModelScope.
For virtual environment setup, refer to Virtual Environment Usage.
python3 -m venv .venv && source .venv/bin/activate
pip install -U modelscope
modelscope download --model radxa/DeepSeek-R1-Distill-Qwen-1.5B_RKLLM --local_dir ./DeepSeek-R1-Distill-Qwen-1.5B_RKLLM
Run the Example
cd DeepSeek-R1-Distill-Qwen-1.5B_RKLLM/demo_Linux_aarch64/
export LD_LIBRARY_PATH=./lib
chmod +x ./llm_demo
./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
Full Conversion Workflow
Set up the development environment by following RKLLM Installation.
Running this example with RKLLM 1.2.3 may cause severe quality degradation (repetitive output). It is recommended to use RKLLM 1.2.2 for this demo. See: GitHub Issue.
Activate the virtual environment
For virtual environment setup, refer to Create Virtual Environment.
conda activate rkllm
pip install -U huggingface_hub
Download the Model
cd RK-SDK/rknn-llm/examples/rkllm_api_demo/
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./DeepSeek-R1-Distill-Qwen-1.5B
Model Conversion
Generate a quantization calibration file and export the model to the RKLLM format.
If you need a different max_context length, adjust the max_context parameter in the llm.build call in export_rkllm.py.
The default is 4096. Larger values use more memory. The value must be ≤ 16384 and a multiple of 32 (e.g., 32, 64, 96, …, 16384).
cd export/
python generate_data_quant.py -m ../DeepSeek-R1-Distill-Qwen-1.5B -o ../DeepSeek-R1-Distill-Qwen-1.5B/data_quant.json
# Before running, update the model path and calibration file path as needed.
python export_rkllm.py
Build the executable
For cross-compiler setup, refer to Compiler Tools.
cd ../deploy/
# Export the cross-compiler path.
export GCC_COMPILER=/path/to/your/gcc/bin/aarch64-linux-gnu
bash build-linux.sh
The generated binaries are located at install/demo_Linux_aarch64.
Deploy to the device
Copy the converted model and the built demo_Linux_aarch64 directory to the device.
cd demo_Linux_aarch64/
export RKLLM_LOG_LEVEL=1
export LD_LIBRARY_PATH=./lib
./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
Run the demo. Type exit to quit.
./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
$ ./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
rkllm init start
I rkllm: rkllm-runtime version: 1.2.2, rknpu driver version: 0.9.8, platform: RK3588
...
rkllm init success
user: Solve x+y=14 and 2x+4y=38.
assistant: x=9, y=5
| Parameter | Required | Description | Notes |
|---|---|---|---|
path | Yes | Path to the RKLLM model | N/A |
max_new_tokens | Yes | Max generated tokens/turn | Must be ≤ max_context_len |
max_context_len | Yes | Max context length | Must be ≤ export max_context |
Performance
For the math prompt: Solve x+y=12 and 2x+4y=34. Find x and y.,
RK3588 achieves 15.36 tokens/s:
| Stage | Total Time (ms) | Tokens | Time per Token (ms) | Tokens per Second |
|---|---|---|---|---|
| Prefill | 122.70 | 29 | 4.23 | 236.35 |
| Generate | 27539.16 | 423 | 65.10 | 15.36 |
RK3582 achieves 10.61 tokens/s:
| Stage | Total Time (ms) | Tokens | Time per Token (ms) | Tokens per Second |
|---|---|---|---|---|
| Prefill | 599.71 | 81 | 7.4 | 135.07 |
| Generate | 76866.41 | 851 | 94.25 | 10.61 |