Skip to main content

RKLLM DeepSeek-R1

DeepSeek-R1 is developed by Hangzhou DeepSeek, a company focused on AI research. This model fully open-sources all training techniques and model weights, with performance aligned to the closed-source OpenAI-o1. DeepSeek has distilled six smaller models from DeepSeek-R1 for the open-source community, including Qwen2.5 and Llama3.1.

This document will guide you through deploying the DeepSeek-R1 distilled model DeepSeek-R1-Distill-Qwen-1.5B using RKLLM onto the RK3588 platform for hardware-accelerated inference using NPU.

rkllm_2.webp

Model File Download

tip

Radxa has already provided precompiled rkllm models and executables. Users can directly download and use them. If you want to reference the compilation process, continue with the optional section below.

  • Use git LFS to download the precompiled rkllm model from ModelScope

    X86 Linux PC
    git lfs install
    git clone https://www.modelscope.cn/radxa/DeepSeek-R1-Distill-Qwen-1.5B_RKLLM.git

(Optional) Model Compilation

tip

Please complete the RKLLM setup on both PC and development board according to RKLLM Installation

tip

For RK358X users, please specify the rk3588 platform as TARGET_PLATFORM

  • On your x86 PC workstation, download the DeepSeek-R1-Distill-Qwen-1.5B model weights. If you haven't installed git-lfs, please install it first.

    X86 Linux PC
    git lfs install
    git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  • Activate the rkllm conda environment. You can refer to RKLLM conda Installation

    X86 Linux PC
    conda activate rkllm
  • Generate the quantization calibration file for the LLM model

    X86 Linux PC
    cd examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export
    python3 generate_data_quant.py -m /path/to/DeepSeek-R1-Distill-Qwen-1.5B
    ParameterRequiredDescriptionOptions
    pathRequiredPath to Huggingface model folderN

    generate_data_quant.py will generate a quantization file named data_quant.json.

  • Modify the model path in rknn-llm/xamples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py

    Python Code
    11 modelpath = '/path/to/DeepSeek-R1-Distill-Qwen-1.5B_Demo'
  • Adjust the maximum context length (max_context)

    If you have specific requirements for max_context, modify the value of the max_context parameter in the llm.build function within rknn-llm/xamples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py. The default is 4096. A higher value increases memory usage. Do not exceed 16,384, and ensure the value is a multiple of 32 (e.g., 32, 64, 96, ..., 16,384).

  • Run the model conversion script

    X86 Linux PC
    python3 export_rkllm.py

    After successful conversion, you will obtain the rkllm model file ./DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm. From the naming convention, this model is W8A8 quantized and designed for the RK3588 platform.

(Optional) Build Executable

  • Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu

  • Modify the main program code in rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/src/llm_demo.cpp

    Comment out line 165. When converting the model, RKLLM automatically parses the chat_template field in the tokenizer_config.json file of the Hugging Face model, so no manual changes are required.

    CPP Code
    165 // rkllm_set_chat_template(llmHandle, "", "<|User|>", "<|Assistant|>");
  • Update the GCC_COMPILER_PATH in the build script rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/build-linux.sh

    BASH
    8 GCC_COMPILER_PATH=/path/to/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
  • Run the build script

    X86 Linux PC
    cd rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/
    bash build-linux.sh

    The generated executable will be located at install/demo_Linux_aarch64.

Board-side Deployment

Terminal Mode

  • Copy the converted DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm model and the compiled demo_Linux_aarch64 folder to the development board.

  • Set the environment variable

    Radxa OS
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/demo_Linux_aarch64/lib
    tip

    Users who downloaded from ModelScope can directly export the librkllmrt.so from the downloaded repository.

  • Run the llm_demo. Type exit to quit.

    Radxa OS
    export RKLLM_LOG_LEVEL=1
    ## Usage: ./llm_demo model_path max_new_tokens max_context_len
    ./llm_demo /path/to/DeepSeek-R1-Distill-Qwen-1.5B_W8A8_RK3588.rkllm 2048 4096
    ParameterRequiredDescriptionOptions
    pathRequiredPath to RKLLM model folderN
    max_new_tokensRequiredMax number of tokens to generateMust be ≤ max_context_len
    max_context_lenRequiredMaximum context size for the modelMust be ≤ model's max_context

    rkllm_2.webp

Performance Analysis

For the math problem: "Solve the equations x + y = 12, 2x + 4y = 34, find the values of x and y",

Performance on RK3588 reaches 15.36 tokens/s:

StageTotal Time (ms)TokensTime per Token (ms)Tokens per Second
Prefill122.70294.23236.35
Generate27539.1642365.1015.36

On RK3582, the performance reaches 10.61 tokens/s:

StageTotal Time (ms)TokensTime per Token (ms)Tokens per Second
Prefill599.71817.4135.07
Generate76866.4185194.2510.61