Qwen2.5 VL

Qwen2.5-VL is a multimodal vision-language model (VLM) family developed by the Qwen team (Alibaba Cloud). Building on the strengths of the previous generation, it improves long-video understanding, ultra-long document parsing, and logical reasoning in complex scenes, aiming to provide more general and practical visual interaction capabilities.

Key features: strong visual perception and alignment, supporting high-resolution images and video inputs longer than 1 hour. A major highlight is the improved “Visual Agent” capability, enabling accurate coordinate grounding, UI interaction, and complex structured data extraction for automation workflows, multimodal search, and high-accuracy visual Q&A.
Model variant: Qwen2.5-VL-3B-Instruct is a mid-sized (~3B parameters) instruction-tuned model. It provides an excellent balance between capability and compute cost, making it suitable for edge devices, real-time interactive applications, and low-resource development environments.

Environment Setup

Follow the llama.cpp document to prepare llama.cpp.

Quick Start

Download the Model

O6 / O6N

pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf --local_dir ./
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE test.png --local_dir ./

Run the Model

O6 / O6N

./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./test.png

Full Conversion Workflow

Clone the Model Repository

O6 / O6N

cd llama.cpp
git clone https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

Create a Virtual Environment

O6 / O6N

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Model Conversion

Convert the Text Module

O6 / O6N

python3 ./convert_hf_to_gguf.py ./Qwen2.5-VL-3B-Instruct

Convert the Vision Module

O6 / O6N

python3 ./convert_hf_to_gguf.py --mmproj ./Qwen2.5-VL-3B-Instruct

Model Quantization

This guide uses Q5_K_M quantization.

O6 / O6N

./build/bin/llama-quantize ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-F16.gguf ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf Q5_K_M

Model Test

Test input image

O6 / O6N

./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./Qwen2.5-VL-3B-Instruct/mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./test.png

Model output:

$ ./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./Qwen2.5-VL-3B-Instruct/mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./Qwen2.5-VL-3B-Instruct/test.png
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 27 key-value pairs and 434 tensors from ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 VL 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-VL
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                        qwen2vl.block_count u32              = 36
llama_model_loader: - kv   7:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv   8:                   qwen2vl.embedding_length u32              = 2048
llama_model_loader: - kv   9:                qwen2vl.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:               qwen2vl.attention.head_count u32              = 16
llama_model_loader: - kv  11:            qwen2vl.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 17
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q5_K:  216 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 2.07 GiB (5.75 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [16, 24, 24, 0]
print_info: model type       = 3B
print_info: model params     = 3.09 B
print_info: general.name     = Qwen2.5 VL 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  2116.07 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   144.00 MiB
llama_kv_cache: size =  144.00 MiB (  4096 cells,  36 layers,  1/1 seqs), K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   304.75 MiB
llama_context: graph nodes  = 1231
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   Qwen2.5 VL 3B Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    519
clip_model_loader: n_kv:         22

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector:          qwen2.5vl_merger
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               3420
load_hparams: n_layer:            32
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     2048

--- vision hparams ---
load_hparams: image_size:         560
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       8
load_hparams: image_min_pixels:   6272
load_hparams: image_max_pixels:   3211264

load_hparams: model size:         1276.39 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta: warmup with image size = 1288 x 1288
alloc_compute_meta:        CPU compute buffer size =   732.56 MiB
alloc_compute_meta: graph splits = 1, nodes = 1092
warmup: flash attention is enabled
main: loading model: ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf
encoding image slice...
image slice encoded in 8425 ms
decoding image batch 1/1, n_tokens_batch = 361
image decoded (batch 1/1) in 13109 ms

The image depicts a single, delicate rose with a soft pink hue, resting on a dark, possibly marble, surface. The rose is positioned near a window, which has a dark frame. The window appears to be letting in some light, creating a contrast between the illuminated rose and the darker surroundings. The overall scene has a serene and somewhat melancholic atmosphere, with the rose being the central focus.


llama_perf_context_print:        load time =     497.68 ms
llama_perf_context_print: prompt eval time =   22189.23 ms /   375 tokens (   59.17 ms per token,    16.90 tokens per second)
llama_perf_context_print:        eval time =    9434.97 ms /    80 runs   (  117.94 ms per token,     8.48 tokens per second)
llama_perf_context_print:       total time =   31913.30 ms /   455 tokens
llama_perf_context_print:    graphs reused =          0

Environment Setup​

Quick Start​

Download the Model​

Run the Model​

Full Conversion Workflow​

Clone the Model Repository​

Create a Virtual Environment​

Model Conversion​

Convert the Text Module​

Convert the Vision Module​

Model Quantization​

Model Test​

Environment Setup

Quick Start

Download the Model

Run the Model

Full Conversion Workflow

Clone the Model Repository

Create a Virtual Environment

Model Conversion

Convert the Text Module

Convert the Vision Module

Model Quantization

Model Test