Vision-language models

This section provides sample demo usage for vision-language models (VLM).

📄️ InternVL2_5-1B

This document explains how to run the InternVL25-1B sample application on a host device equipped with the Radxa AICore AX-M1.

📄️ InternVL3-2B

This document explains how to run the InternVL3-2B sample application on a host device equipped with the Radxa AICore AX-M1. For model conversion, please refer to here.

📄️ YOLO-World-V2

This document explains how to run the YOLO-World-V2 sample application on a host device equipped with the Radxa AICore AX-M1.

📄️ Qwen2.5-VL-3B-Instruct

This document explains how to run the Qwen2.5-VL-3B-Instruct sample application on a host device equipped with the Radxa AICore AX-M1.

Qwen3.5 is a native multimodal large model released by Alibaba Cloud's Tongyi Lab in February 2026, using a hybrid architecture (linear attention + MoE) with 397 billion total parameters and 17 billion activated parameters, supporting 201 languages, and performing excellently in reasoning, programming, agent capabilities, and multimodal understanding.

📄️ Qwen3-VL-2B-Instruct

This document demonstrates how to run the Qwen3-VL-2B-Instruct model on Radxa AX-M1.:

📄️ LocateAnything-3B

Deploy LocateAnything-3B on an 8 GB AX-M1 and locate image targets through the CLI, an OpenAI-compatible API, and a WebUI.