RKLLM DeepSeek-R1

DeepSeek-R1 is a state-of-the-art reasoning model developed by DeepSeek. DeepSeek has open-sourced the training approach and model weights, and its performance is competitive with closed-source reasoning models. DeepSeek also released multiple distilled open-source lightweight variants (covering the Qwen2.5 and Llama3.1 families) using knowledge distillation. This document demonstrates how to deploy the distilled DeepSeek-R1-Distill-Qwen-1.5B model to an RK3588 device with the RKLLM toolchain and run hardware-accelerated inference on the built-in NPU.

Quick Start

Download the demo

Download the complete demo from ModelScope.

For virtual environment setup, refer to Virtual Environment Usage.

Device

python3 -m venv .venv && source .venv/bin/activate
pip install -U modelscope
modelscope download --model radxa/DeepSeek-R1-Distill-Qwen-1.5B_RKLLM --local_dir ./DeepSeek-R1-Distill-Qwen-1.5B_RKLLM

Run the Example

Device

cd DeepSeek-R1-Distill-Qwen-1.5B_RKLLM/demo_Linux_aarch64/
export LD_LIBRARY_PATH=./lib
chmod +x ./llm_demo
./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096

Full Conversion Workflow

Prerequisites

Set up the development environment by following RKLLM Installation.

Version note

Running this example with RKLLM 1.2.3 may cause severe quality degradation (repetitive output). It is recommended to use RKLLM 1.2.2 for this demo. See: GitHub Issue.

Activate the virtual environment

For virtual environment setup, refer to Create Virtual Environment.

X64 Linux PC

conda activate rkllm
pip install -U huggingface_hub

Download the Model

X64 Linux PC

cd RK-SDK/rknn-llm/examples/rkllm_api_demo/
hf download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./DeepSeek-R1-Distill-Qwen-1.5B

Model Conversion

Generate a quantization calibration file and export the model to the RKLLM format.

tip

If you need a different max_context length, adjust the max_context parameter in the llm.build call in export_rkllm.py. The default is 4096. Larger values use more memory. The value must be ≤ 16384 and a multiple of 32 (e.g., 32, 64, 96, …, 16384).

X64 Linux PC

cd export/
python generate_data_quant.py -m ../DeepSeek-R1-Distill-Qwen-1.5B -o ../DeepSeek-R1-Distill-Qwen-1.5B/data_quant.json
# Before running, update the model path and calibration file path as needed.
python export_rkllm.py

Build the executable

For cross-compiler setup, refer to Compiler Tools.

X64 Linux PC

cd ../deploy/
# Export the cross-compiler path.
export GCC_COMPILER=/path/to/your/gcc/bin/aarch64-linux-gnu
bash build-linux.sh

The generated binaries are located at install/demo_Linux_aarch64.

Deploy to the device

Copy the converted model and the built demo_Linux_aarch64 directory to the device.

Device

cd demo_Linux_aarch64/
export RKLLM_LOG_LEVEL=1
export LD_LIBRARY_PATH=./lib
./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096

Run the demo. Type exit to quit.

Device

./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096

$ ./llm_demo ../DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
rkllm init start
I rkllm: rkllm-runtime version: 1.2.2, rknpu driver version: 0.9.8, platform: RK3588
...
rkllm init success

user: Solve x+y=14 and 2x+4y=38.
assistant: x=9, y=5

Parameter	Required	Description	Notes
`path`	Yes	Path to the RKLLM model	N/A
`max_new_tokens`	Yes	Max generated tokens/turn	Must be ≤ `max_context_len`
`max_context_len`	Yes	Max context length	Must be ≤ export `max_context`

Performance

For the math prompt: Solve x+y=12 and 2x+4y=34. Find x and y.,

RK3588 achieves 15.36 tokens/s:

Stage	Total Time (ms)	Tokens	Time per Token (ms)	Tokens per Second
Prefill	122.70	29	4.23	236.35
Generate	27539.16	423	65.10	15.36

RK3582 achieves 10.61 tokens/s:

Stage	Total Time (ms)	Tokens	Time per Token (ms)	Tokens per Second
Prefill	599.71	81	7.4	135.07
Generate	76866.41	851	94.25	10.61

Quick Start​

Download the demo​

Run the Example​

Full Conversion Workflow​

Activate the virtual environment​

Download the Model​

Model Conversion​

Build the executable​

Deploy to the device​

Performance​

Quick Start

Download the demo

Run the Example

Full Conversion Workflow

Activate the virtual environment

Download the Model

Model Conversion

Build the executable

Deploy to the device

Performance