Vision-language models
This section provides sample demo usage for vision-language models (VLM).
📄️ InternVL2_5-1B
This document explains how to run the InternVL25-1B sample application on a host device equipped with the Radxa AICore AX-M1.
📄️ InternVL3-2B
This document explains how to run the InternVL3-2B sample application on a host device equipped with the Radxa AICore AX-M1. For model conversion, please refer to here.
📄️ YOLO-World-V2
This document explains how to run the YOLO-World-V2 sample application on a host device equipped with the Radxa AICore AX-M1.
📄️ Qwen2.5-VL-3B-Instruct
This document explains how to run the Qwen2.5-VL-3B-Instruct sample application on a host device equipped with the Radxa AICore AX-M1.
📄️ Qwen3.5
Qwen3.5 is a native multimodal large model released by Alibaba Cloud's Tongyi Lab in February 2026, using a hybrid architecture (linear attention + MoE) with 397 billion total parameters and 17 billion activated parameters, supporting 201 languages, and performing excellently in reasoning, programming, agent capabilities, and multimodal understanding.
📄️ Qwen3-VL-2B-Instruct
This document demonstrates how to run the Qwen3-VL-2B-Instruct model on Radxa AX-M1.: