Skip to main content

RKLLM SmolVLM2

SmolVLM2 is a compact yet powerful vision-language model developed by Hugging Face, designed to bring advanced vision-language capabilities to resource-constrained devices such as smartphones and embedded systems. These models are characterized by their small footprint and are suitable for running on compact devices, bridging the gap between large models and limited hardware resources. This document describes how to use RKLLM to deploy SmolVLM2 256M / 500M / 2.2B on RK3588 and run inference accelerated by the NPU.

tip

Original Contribution

This model was provided by the Radxa community user @Rients Politiek.

Original Radxa community forum post: SmolVLM2 for RK3588 NPU

Model Deployment

SmolVLM2 provides three model sizes. Please choose the parameters according to your needs.

Parameter Selection

Device
export MODEL_SIZE=256m REPO_SIZE=256M

Download the Code

Device
git clone https://github.com/Qengineering/SmolVLM2-${REPO_SIZE}-NPU.git && cd SmolVLM2-${REPO_SIZE}-NPU

Build the Project

Install Dependencies

Device
sudo apt update
sudo apt install cmake gcc g++ make libopencv-dev

Build with CMake

Device
cmake -B build -DRK_LIB_PATH=${PWD}/aarch64/library -DCMAKE_CXX_FLAGS="-I${PWD}/aarch64/include"
cmake --build build -j4

Download the Model

Install hf-cli

Device
curl -LsSf https://hf.co/cli/install.sh | bash

Download the Model

Device
hf download Qengineering/SmolVLM2-${MODEL_SIZE}-rk3588 --local-dir ./SmolVLM2-${MODEL_SIZE}-rk3588

Run the Examples

Device
export RKLLM_LOG_LEVEL=1
# VLM_NPU Picture RKNN_model RKLLM_model NewTokens ContextLength
./VLM_NPU ./Moon.jpg ./SmolVLM2-${MODEL_SIZE}-rk3588/smolvlm2_${MODEL_SIZE}_vision_fp16_rk3588.rknn ./SmolVLM2-${MODEL_SIZE}-rk3588/smolvlm2-${MODEL_SIZE}-instruct_w8a8_rk3588.rkllm 2048 4096

input image

prompt: <image>Describe the image.
rock@rock-5b-plus:~/SmolVLM2-256M-NPU$ ./VLM_NPU ./Moon.jpg ./SmolVLM2-${MODEL_SIZE}-rk3588/smolvlm2_${MODEL_SIZE}_vision_fp16_rk3588.rknn ./SmolVLM2-${MODEL_SIZE}-rk3588/smolvlm2-${MODEL_SIZE}-instruct_w8a8_rk3588.rkllm 2048 4096
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from ./SmolVLM2-256m-rk3588/smolvlm2-256m-instruct_w8a8_rk3588.rkllm
I rkllm: rkllm-toolkit version: 1.2.2, max_context_limit: 4096, npu_core_num: 3, target_platform: RK3588, model_dtype: W8A8
I rkllm: Enabled cpus: [4, 5, 6, 7]
I rkllm: Enabled cpus num: 4
rkllm init success
I rkllm: reset chat template:
I rkllm: system_prompt: <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n
I rkllm: prompt_prefix: <|im_start|>user\n
I rkllm: prompt_postfix: <|im_end|>\n<|im_start|>assistant\n
W rkllm: Calling rkllm_set_chat_template will disable the internal automatic chat template parsing, including enable_thinking. Make sure your custom prompt is complete and valid.

used NPU cores 3

model input num: 1, output num: 1

Input tensors:
index=0, name=pixel_values, n_dims=4, dims=[1, 384, 384, 3], n_elems=442368, size=884736, fmt=NHWC, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000

Output tensors:
index=0, name=output, n_dims=3, dims=[1, 36, 576, 0], n_elems=20736, size=41472, fmt=UNDEFINED, type=FP16, qnt_type=AFFINE, zp=0, scale=1.000000

Model input height=384, width=384, channel=3


User: <image>Describe the image.
Answer: The image depicts a scene from space, specifically looking at the moon's surface. The moon is in the process of being tidied up and has been cleaned to remove any debris or stains that might have accumulated over time. The overall atmosphere appears to be clear and bright, with no visible signs of pollution or other human activity.

The image also includes a large number of small objects scattered across the surface of the moon, which appear to be rocks or boulders. These objects are scattered randomly around the moon's surface, creating a sense of randomness and disorder. The overall atmosphere is calm and serene, with no signs of any movement or activity in the scene.

Overall, this image gives a sense of the beauty and cleanliness of the lunar environment, as well as the ongoing process of tidying up the moon's surface.
I rkllm: --------------------------------------------------------------------------------------
I rkllm: Model init time (ms) 227.84
I rkllm: --------------------------------------------------------------------------------------
I rkllm: Stage Total Time (ms) Tokens Time per Token (ms) Tokens per Second
I rkllm: --------------------------------------------------------------------------------------
I rkllm: Prefill 97.59 78 1.25 799.24
I rkllm: Generate 2643.09 166 15.92 62.81
I rkllm: --------------------------------------------------------------------------------------
I rkllm: Peak Memory Usage (GB)
I rkllm: 0.59
I rkllm: --------------------------------------------------------------------------------------

Performance Analysis

On ROCK 5B+ it reaches 62.81 tokens/s.

StageTotal Time (ms)TokensTime per Token (ms)Tokens per Second
Prefill97.59781.25799.24
Generate2643.0916615.9262.81

Memory Usage

256M500M2.2B
Peak Memory Usage (GB)0.590.883.39

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0