Skip to main content

RKLLM Qwen2-VL

Qwen2-VL is a multimodal vision-language model (VLM) developed by Alibaba. It provides strong visual perception, adapts to images of different resolutions and aspect ratios, and supports deeper understanding of long videos (20+ minutes). Qwen2-VL also supports multiple languages and can act as an “agent” for tasks such as phone control and robot instruction execution. This document explains how to deploy Qwen2-VL-2B-Instruct on RK3588 using the RKLLM toolchain and run hardware-accelerated inference on the built-in NPU.

rkllm_qwen2_vl_1.webp

Quick Start

Download the demo

Download the complete demo from ModelScope.

For virtual environment setup, refer to Virtual Environment Usage.

Device
python3 -m venv .venv && source .venv/bin/activate
pip install -U modelscope
modelscope download --model radxa/Qwen2-VL-2B-RKLLM --local_dir ./Qwen2-VL-2B-RKLLM

Run the Example

Device
cd Qwen2-VL-2B-RKLLM/demo_Linux_aarch64/
export LD_LIBRARY_PATH=./lib
chmod +x ./demo
./demo demo.jpg ../qwen2_vl_2b_vision_rk3588.rknn ../qwen2-vl-2b-instruct_W8A8_rk3588.rkllm 2048 4096 3 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"

Full Conversion Workflow

Prerequisites

Set up the development environment by following RKNN Installation and RKLLM Installation.

RKLLM currently only converts the language model part, so deploying a multimodal model requires converting the vision encoder with the RKNN toolchain.

Activate the virtual environment

For virtual environment setup, refer to Create Virtual Environment.

X64 Linux PC
conda activate rkllm
pip install -U huggingface_hub

Download the Model

X64 Linux PC
cd RK-SDK/rknn-llm/examples/multimodal_model_demo/
hf download Qwen/Qwen2-VL-2B-Instruct --local-dir ./Qwen2-VL-2B-Instruct

Model Conversion

Generate static positional encodings.

X64 Linux PC
python export/export_vision_qwen2.py --step 1 --path ./Qwen2-VL-2B-Instruct
ParameterRequiredDescriptionNotes
stepYesExport step1/2. When step==1, only generates cu_seqlens and rotary_pos_emb. When step==2, exports ONNX (run step==1 first).
pathNoHugging Face model directoryDefault: Qwen/Qwen2-VL-2B-Instruct
batchNoBatch sizeDefault: 1
heightNoImage heightDefault: 392
widthNoImage widthDefault: 392
savepathNoOutput path for ONNX/RKNNDefault: qwen2-vl-2b/qwen2_vl_2b_vision.onnx

Export the vision module to ONNX.

X64 Linux PC
pip install onnx==1.18
python export/export_vision_qwen2.py --step 2 --path ./Qwen2-VL-2B-Instruct

Convert the vision module to RKNN. For RKNN virtual environment setup, refer to Create Virtual Environment.

X64 Linux PC
conda activate rknn
python export/export_vision_rknn.py --path /path/to/save/qwen2-vl-vision.onnx --target-platform rk3588

Generate a quantization calibration file.

X64 Linux PC
conda activate rkllm
python data/make_input_embeds_for_quantize.py --path /path/to/Qwen2-VL-model
ParameterRequiredDescriptionNotes
pathYesHugging Face model directoryN/A

Export the language module to the RKLLM format.

X64 Linux PC
python export/export_rkllm.py
ParameterRequiredDescriptionNotes
pathNoHugging Face model directoryDefault: Qwen/Qwen2-VL-2B-Instruct
target-platformNoTarget platformrk3588 / rk3576 / rk3562 (default: rk3588)
num_npu_coreNoNPU core countrk3588: [1,2,3], rk3576: [1,2], rk3562: [1] (default: 3)
quantized_dtypeNoRKLLM quantization dtypeDefaults to w8a8 (supported options depend on the platform)
deviceNoDevice used during conversioncpu / cuda (default: cpu)
savepathNoOutput RKLLM model pathDefault: qwen2_vl_2b_instruct.rkllm

Build the executable

For cross-compiler setup, refer to Compiler Tools.

X64 Linux PC
cd deploy/
export GCC_COMPILER=/path/to/your/gcc/bin/aarch64-linux-gnu
bash build-linux.sh

The generated binaries are located at install/demo_Linux_aarch64.

Deploy to the device

Copy the converted models and the built demo_Linux_aarch64 directory to the device.

Device
cd demo_Linux_aarch64/
export RKLLM_LOG_LEVEL=1
export LD_LIBRARY_PATH=./lib
./demo demo.jpg ../qwen2_vl_2b_vision_rk3588.rknn ../qwen2-vl-2b-instruct_W8A8_rk3588.rkllm 2048 4096 3 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"

Run the demo. Type exit to quit.

Device
./demo demo.jpg ../qwen2_vl_2b_vision_rk3588.rknn ../qwen2-vl-2b-instruct_W8A8_rk3588.rkllm 2048 4096 3 "<|vision_start|>" "<|vision_end|>" "<|image_pad|>"
ParameterRequiredDescriptionNotes
image_pathYesImage pathN/A
encoder_model_pathYesVision encoder RKNNN/A
llm_model_pathYesLanguage model RKLLMN/A
max_new_tokensYesMax generated tokensMust be ≤ max_context_len
max_context_lenYesMax context lengthMust be > text_token_num + image_token_num + max_new_tokens
core_numYesNPU core countrk3588: [1,2,3], rk3576: [1,2], rk3562: [1]
$ ./demo demo.jpg ../qwen2_vl_2b_vision_rk3588.rknn ../qwen2-vl-2b-instruct_W8A8_rk3588.rkllm 2048 4096 3
"<|vision_start|>" "<|vision_end|>" "<|image_pad|>"
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ../qwen2-vl-2b-instruct_W8A8_rk3588.rkllm
I rkllm: rkllm-toolkit version: 1.2.3, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
I rkllm: Using mrope
rkllm init success
main: LLM Model loaded in 3052.79 ms
===the core num is 3===
model input num: 1, output num: 1
input tensors:
index=0, name=onnx::Expand_0, n_dims=4, dims=[1, 392, 392, 3], n_elems=460992, size=921984, fmt=NHWC, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
output tensors:
index=0, name=4542, n_dims=2, dims=[196, 1536, 0, 0], n_elems=301056, size=602112, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000
model input height=392, width=392, channel=3
main: ImgEnc Model loaded in 2362.74 ms
main: ImgEnc Model inference took 3762.45 ms

**********************You can choose a preset question or type your own********************

[0] <image>What is in the image?

*************************************************************************

user: 0
<image>What is in the image?
assistant: The image depicts an astronaut sitting on a chair holding a green bottle, looking at Earth from the Moon with a starry sky in the background.

Test image:

demo.jpg

Performance:

StageTotal Time (ms)TokensTime per Token (ms)Tokens per Second
Prefill929.402224.19238.86
Generate3897.426064.9615.39

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0