Skip to main content

RKLLM Usage

This document explains how to use RKLLM to deploy Hugging Face-format LLMs to RK3588 and run hardware-accelerated inference on the NPU.

Supported models

This guide uses Qwen2.5-1.5B-Instruct as an example and follows the demo scripts in the RKLLM repository to walk through an end-to-end deployment on an RK3588 device with NPU acceleration.

tip

If you haven't installed and configured RKLLM yet, follow RKLLM Installation.

Model Conversion

tip

For RK358x, set TARGET_PLATFORM to rk3588.

This section uses Qwen2.5-1.5B-Instruct as an example. You can also pick any model from the Supported models list.

  • On an x86 Linux PC, download the model weights (install git-lfs if needed):

    X86 Linux PC
    git lfs install
    git clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
  • Activate the rkllm conda environment (see RKLLM conda installation):

    X86 Linux PC
    conda activate rkllm
  • Generate the quantization calibration file for the LLM

    tip

    For LLM models, this guide uses the conversion scripts under rknn-llm/xamples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export.

    For VLM models, use rknn-llm/examples/Qwen2-VL_Demo/export. For multimodal VLM models, see RKLLM Qwen2-VL.

    X86 Linux PC
    cd examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export
    python3 generate_data_quant.py -m /path/to/Qwen2.5-1.5B-Instruct
    ParameterRequiredDescriptionNotes
    pathYesHugging Face model directoryN/A

    generate_data_quant.py generates data_quant.json, which is used during quantization.

  • Update the modelpath in rknn-llm/xamples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py:

    Python Code
    11 modelpath = '/path/to/Qwen2.5-1.5B-Instruct'
  • Adjust max_context (optional)

    If you need a different context length, modify max_context in the llm.build call in rknn-llm/xamples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/export/export_rkllm.py. Default is 4096. Larger values consume more memory. The value must be ≤ 16384 and a multiple of 32 (e.g., 32, 64, 96, …, 16384).

  • Run the conversion script

    X86 Linux PC
    python3 export_rkllm.py

    After a successful conversion, you should get an RKLLM model such as Qwen2.5-1.5B-Instruct_W8A8_RK3588.rkllm. The name indicates this model is W8A8-quantized and targeted for RK3588.

Build the executable

  • Download the cross-compilation toolchain: gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu

  • Update the main program: rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/src/llm_demo.cpp

    Comment out line 165. RKLLM parses the chat_template from tokenizer_config.json automatically during conversion, so you don't need to set it manually.

    CPP Code
    165 // rkllm_set_chat_template(llmHandle, "", "<|User|>", "<|Assistant|>");
  • Update GCC_COMPILER_PATH in rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/build-linux.sh

    BASH
    8 GCC_COMPILER_PATH=/path/to/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
  • Build

    X86 Linux PC
    cd rknn-llm/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/
    bash build-linux.sh

    The generated binaries are located at install/demo_Linux_aarch64.

Deploy to the device

Local terminal mode

  • Copy the converted RKLLM model and the built demo_Linux_aarch64 folder to the device.

  • Export environment variables:

    Radxa OS
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/demo_Linux_aarch64/lib
  • Run llm_demo (type exit to quit):

    Radxa OS
    export RKLLM_LOG_LEVEL=1
    ## Usage: ./llm_demo model_path max_new_tokens max_context_len
    ./llm_demo /path/to/Qwen2.5-1.5B-Instruct_W8A8_RK3588.rkllm 2048 4096
    tip

    If you see failed to open rknpu module or failed to open rknn device here, it usually means the RKNPU2 userspace package is missing or the driver version doesn't meet RKLLM requirements. Go back to the RKLLM Install board-side driver configuration section, confirm that rknpu2-rk3588 is installed, and check /sys/kernel/debug/rknpu/version.

    ParameterRequiredDescriptionNotes
    pathYesPath to the RKLLM modelN/A
    max_new_tokensYesMax generated tokens/turnMust be ≤ max_context_len
    max_context_lenYesMax context lengthMust be ≤ export max_context

    rkllm_2.webp

Performance comparison (selected models)

ModelParameter SizeChipChip CountInference Speed
TinyLlama1.1BRK3588115.03 token/s
Qwen1.8BRK3588114.18 token/s
Phi33.8BRK358816.46 token/s
ChatGLM36BRK358813.67 token/s
Qwen2.51.5BRK3588115.44 token/s

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0