Skip to main content

RKLLM Qwen2-VL

Qwen2-VL is a multi-modal VLM model developed by Alibaba. Qwen2-VL can understand images of various resolutions and aspect ratios, comprehend videos longer than 20 minutes, function as an agent for operating mobile devices and robots, and supports multiple languages.
This document will explain how to deploy the Qwen2-VL-2B-Instruct visual multi-modal model on RK3588 using NPU for hardware-accelerated inference.

rkllm_qwen2_vl_1.webp

Model File Download

tip

Radxa has provided precompiled rkllm models and executables, which users can download and use directly. If you want to refer to the compilation process, please continue with the optional section.

  • Use git LFS to download the precompiled rkllm from ModelScope:

    X86 Linux PC
    git lfs install
    git clone https://www.modelscope.cn/radxa/Qwen2-VL-2B-RKLLM.git

(Optional) Model Compilation

tip

Please prepare the RKLLM working environment on both your PC and development board according to RKLLM Installation.

tip

For RK358X users, please specify the rk3588 platform for TARGET_PLATFORM.

  • On x86 PC workstation, download the Qwen2-VL-2B-Instruct weight files. If you haven't installed git-lfs, please install it first:

    X86 Linux PC
    git lfs install
    git clone https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
  • Activate the rkllm conda environment. You can refer to RKLLM conda installation for details:

    X86 Linux PC
    conda activate rkllm

Compile Image Decoding Model

  • Install rknn-toolkit2:

    X86 Linux PC
    pip3 install rknn-toolkit2 -i https://mirrors.aliyun.com/pypi/simple
  • Convert to ONNX

    • Generate cu_seqlens and rotary_pos_emb

      X86 Linux PC
      python3 export/export_vision.py --step 1 --path /path/to/Qwen2-VL-2B-Instruct/ --batch 1 --height 392 --width 392
    • Export as ONNX

      X86 Linux PC
      python3 export/export_vision.py --step 2 --path /path/to/Qwen2-VL-2B-Instruct/ --batch 1 --height 392 --width 392
    Parameter NameRequiredDescriptionOptions
    stepRequiredExport step.1/2, When step==1, only generates cu_seqlens and rotary_pos_emb; when step==2, exports ONNX. Must run step == 1 before step == 2.
    pathOptionalPath to Huggingface model folder.Default: Qwen/Qwen2-VL-2B-Instruct
    batchOptionalBatch sizeDefault: 1
    heightOptionalImage heightDefault: 392
    widthOptionalImage widthDefault: 392
    savepathOptionalSave path for RKNN modelDefault: qwen2-vl-2b/qwen2_vl_2b_vision.onnx

Compile RKLLM Model

  • Generate VLM model quantization calibration file:

    X86 Linux PC
    cd rknn-llm/examples/Qwen2-VL_Demo
    python3 data/make_input_embeds_for_quantize.py --path /path/to/Qwen2-VL-2B-Instruct
    ParameterRequiredDescriptionOptions
    pathRequiredPath to Huggingface model folder.N

    The generated calibration file is saved in data/input.json.

  • Modify the maximum context value max_context

    If you need to adjust the max_context length, modify the max_context parameter in the llm.build function interface in rknn-llm/examples/Qwen2-VL_Demo/export/export_rkllm.py. Larger values consume more memory. It must not exceed 16,384 and must be a multiple of 32 (e.g., 32, 64, 96, ..., 16,384).

  • Run the model conversion script:

    X86 Linux PC
    python3 export_rkllm.py --path /path/to/Qwen2-VL-2B-Instruct/  --target-platform rk3588 --num_npu_core 3 --quantized_dtype w8a8 --device cuda --savepath ./qwen2-vl-llm_rk3588.rkllm
    ParameterRequiredDescriptionOptions
    pathOptionalPath to Huggingface model folder.Default: Qwen/Qwen2-VL-2B-Instruct
    target-platformOptionalTarget running platformrk3588/rk3576/rk3562, default: rk3588
    num_npu_coreOptionalNumber of NPU coresFor rk3588: [1,2,3]; rk3576: [1,2]; rk3562: [1]. Default: 3
    quantized_dtypeOptionalRKLLM quantization typerk3588: “w8a8”, “w8a8_g128”, “w8a8_g256”, “w8a8_g512”; rk3576: “w4a16”, “w4a16_g32”, “w4a16_g64”, “w4a16_g128”, “w8a8”; rk3562: “w8a8”, “w4a16_g32”, “w4a16_g64”, “w4a16_g128”, “w4a8_g32”. Default: w8a8
    deviceOptionalDevice used during model conversioncpu or cuda. Default: cpu
    savepathOptionalSave path for RKLLM modelDefault: qwen2_vl_2b_instruct.rkllm

    The generated RKLLM model is named qwen2-vl-llm_rk3588.rkllm.

(Optional) Build Executable

  • Download the cross-compilation toolchain gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu

  • Modify the main program rknn-llm/examples/Qwen2-VL_Demo/deploy/src/main.cpp

    You need to comment out line 179. When converting the model, RKLLM will automatically parse the chat_template field in the tokenizer_config.json file of the Hugging Face model, so there's no need to modify it.

    CPP Code
    179 // rkllm_set_chat_template(llmHandle, "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n", "<|im_start|>user\n", "<|im_end|>\n<|im_start|>assistant\n");
  • Modify the main program rknn-llm/examples/Qwen2-VL_Demo/deploy/src/llm.cpp

    You need to comment out line 120. When converting the model, RKLLM will automatically parse the chat_template field in the tokenizer_config.json file of the Hugging Face model, so there's no need to modify it.

    CPP Code
    120 // rkllm_set_chat_template(llmHandle, "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n", "<|im_start|>user\n", "<|im_end|>\n<|im_start|>assistant\n");
  • Modify the GCC_COMPILER_PATH in the compilation script rknn-llm/examples/Qwen2-VL_Demo/deploy/build-linux.sh

    BASH
    5 GCC_COMPILER_PATH=/path/to/gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu
  • Run the model conversion script

    X86 Linux PC
    cd rknn-llm/examples/Qwen2-VL_Demo/deploy
    bash build-linux.sh

    The generated executable file is located in install/demo_Linux_aarch64

Deploying on Device

Terminal Mode

  • Copy the converted model qwen2-vl-llm_rk3588.rkllm and the compiled folder demo_Linux_aarch64 to the device

  • Set environment variables

    Radxa OS
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/demo_Linux_aarch64/lib
    tip

    Users who downloaded via ModelScope can directly export the librkllmrt.so from the downloaded repository.

  • Run llm_demo, enter exit to quit

    Radxa OS
    export RKLLM_LOG_LEVEL=1
    ## Usage: ./demo image_path encoder_model_path llm_model_path max_new_tokens max_context_len core_num
    ./demo demo.jpg ./qwen2_vl_2b_vision_rk3588.rknn ./qwen2-vl-llm_rk3588.rkllm 128 512 3
    ParameterRequiredDescriptionOptions
    image_pathRequiredPath to input imageN/A
    encoder_model_pathRequiredPath to rknn vision encoder modelN/A
    llm_model_pathRequiredPath to rkllm modelN/A
    max_new_tokensRequiredMax number of tokens to generate per roundMust be ≤ max_context_len
    max_context_lenRequiredMaximum context length for the modelMust be > text-token-num + image-token-num + max_new_tokens
    core_numRequiredNumber of NPU cores to useFor rk3588: [1,2,3], For rk3576: [1,2], For rk3562: [1]

    rkllm_2.webp

Performance Analysis

On RK3588, up to 15.39 tokens/s,

StageTotal Time (ms)TokensTime per Token (ms)Tokens per Second
Prefill929.402224.19238.86
Generate3897.426064.9615.39