Skip to main content

Qwen3-VL-2B-Instruct

This document demonstrates how to run the Qwen3-VL-2B-Instruct model on Radxa AX-M1.:

ModelParametersQuantizationHugging Face Repo
Qwen3-VL-2B-Instruct2BGPTQ-Int4AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4

Install axllm

axllm is an LLM inference tool provided by AXERA, supporting command-line interaction and OpenAI-compatible API.

Method 1: Clone the repo and run the install script

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Method 2: One-line install (default branch axllm)

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Method 3: Download executable from GitHub Actions CI

If you don't have a build environment, go to ax-llm Actions to download the latest CI-exported executable:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Download Model

Create virtual environment and install huggingface_hub

python3 -m venv .venv
source .venv/bin/activate
pip install huggingface_hub

Download model

hf download AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4 --local-dir ./Qwen3-VL-2B

Run Model

Command-line interactive mode

axllm run Qwen3-VL-2B/

VLM Usage Instructions:

  • After entering prompt, you'll see image >>
    • Press Enter directly: Text-only conversation
    • Enter image path: Image + text conversation
    • Enter video:<frames_dir>: Video/multi-frame conversation
Usage Tip

If you typo and delete when entering the image path, the image path may not be recognized correctly. In this case, enter /reset to reset kvcache and re-enter.

Qwen3-VL-2B Example

$ axllm run Qwen3-VL-2B/
...
Commands:
/q, /exit exit /reset reset kvcache
/dd delete one turn /pp print history
Ctrl+C: stop current generationVLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >>What is this picture about?
image >> ./image.png
18:53:42.571 INF Run:1023 | ttft: 740.26 ms

This image depicts astronauts exploring in a jungle. They are wearing white spacesuits, standing among lush green plants, with a dark background and a overall cool tone, creating a mysterious, sci-fi atmosphere.

18:57:35.909 NTC Run:1145 | hit eos,avg 4.65 token/s

OpenAI-compatible API server mode

axllm serve Qwen3-VL-2B/

After the server starts, you can call it via HTTP requests:

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "Qwen3-VL-2B"

messages = [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": "Hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)

print(completion.choices[0].message.content)

Performance

ModelInput SizeImage CountTTFTGeneration SpeedCMM Memory
Qwen3-VL-2B-Instruct384×3841740.26 ms4.65 token/s2384 MB
info
  • TTFT (Time To First Token): First token latency
  • Generation Speed: Unit is tokens/second
  • Test Platform: Rock 5B Plus + AX-M1

    You need to be logged into GitHub to post a comment. If you are already logged in, please ignore this message.

    Radxa-docs © 2026 by Radxa Computer (Shenzhen) Co.,Ltd. is licensed under CC BY 4.0