Skip to main content

RKLLM Usage and Deploy LLM

This document describes how to deploy Huggingface-format large language models onto RK3588 using RKLLM for hardware-accelerated inference on the NPU.

Currently Supported Models

We will use Qwen2.5-1.5B-Instruct as an example and follow the sample scripts provided in the RKLLM repository to fully demonstrate how to deploy a large language model from scratch onto a development board equipped with the RK3588 chip, utilizing the NPU for hardware-accelerated inference.

tip

If you have not installed or configured the RKLLM environment yet, please refer to RKLLM Installation.

Model Conversion

tip

For RK358X users, please specify rk3588 as the TARGET_PLATFORM.

We will use Qwen2.5-1.5B-Instruct as an example, but you may choose any model from the list of currently supported models.

  • Download the weights of Qwen2.5-1.5B-Instruct on your x86 PC workstation. If you haven't installed git-lfs, please install it first.

    X86 Linux PC
    git lfs install
    git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
  • Activate the rkllm conda environment. You can refer to RKLLM Conda Installation.

    X86 Linux PC
    conda activate rkllm
  • Generate the LLM model quantization calibration file.

    tip

    For LLM models, we use the conversion script provided in rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export.

    For VLM models, use the conversion script in rknn-llm/examples/Qwen2-VL_Demo/export. For multimodal VLM models, please refer to RKLLM Qwen2-VL.

    X86 Linux PC
    cd examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export
    python3 generate_data_quant.py -m /path/to/Qwen2.5-1.5B-Instruct
    ParameterRequiredDescriptionOptions
    pathRequiredPath to the Huggingface model folder.N

    The generate_data_quant.py script generates the quantization file data_quant.json used during model quantization.

  • Update the modelpath variable in rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py to point to your model path.

    Python Code
    11 modelpath = '/path/to/Qwen2.5-1.5B-Instruct'
  • Adjust the maximum context length max_context

    If you need a specific max_context length, modify the value of the max_context parameter in the llm.build function within rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py. The default is 4096; larger values consume more memory. It must not exceed 16,384 and must be a multiple of 32 (e.g., 32, 64, 96, ..., 16,384).

  • Run the model conversion script.

    X86 Linux PC
    python3 export_rkllm.py

    After successful conversion, you will get an .rkllm model file — in this case, Qwen2.5-1.5B-Instruct_W8A8_RK3588.rkllm. From the filename, you can see that this model has been quantized using W8A8 and is compatible with the RK3588 platform.

Compiling the Executable

  • Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu

  • Modify the main program code in rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/src/llm_demo.cpp

    You need to comment out line 165, since RKLLM automatically parses the chat_template field from the tokenizer_config.json file when converting the model, so there's no need to manually set it.

    CPP Code
    165 // rkllm_set_chat_template(llmHandle, "", "<|User|>", "<|Assistant|>");
  • Update the GCC_COMPILER_PATH in the build script rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/build-linux.sh

    BASH
    8 GCC_COMPILER_PATH=/path/to/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
  • Run the model conversion script.

    X86 Linux PC
    cd rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/
    bash build-linux.sh

    The compiled executable will be located in install/demo_Linux_aarch64.

Deployment on Device

Local Terminal Mode

  • Copy the converted .rkllm model and the compiled demo_Linux_aarch64 folder to the device.

  • Set up environment variables

    Radxa OS
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/demo_Linux_aarch64/lib
  • Run llm_demo, type exit to quit

    Radxa OS
    export RKLLM_LOG_LEVEL=1
    ## Usage: ./llm_demo model_path max_new_tokens max_context_len
    ./llm_demo /path/to/Qwen2.5-1.5B-Instruct_W8A8_RK3588.rkllm 2048 4096
    ParameterRequiredDescriptionOptions
    pathRequiredPath to the RKLLM model folder.N
    max_new_tokensRequiredMaximum number of tokens to generate per round.Must be less than or equal to max_context_len
    max_context_lenRequiredMaximum context size for the model.Must be less than or equal to the max_context used during model conversion

    rkllm_2.webp

Performance Comparison for Selected Models

ModelParameter SizeChipChip CountInference Speed
TinyLlama1.1BRK3588115.03 token/s
Qwen1.8BRK3588114.18 token/s
Phi33.8BRK358816.46 token/s
ChatGLM36BRK358813.67 token/s
Qwen2.51.5BRK3588115.44 token/s