Qwen2.5 VL
Qwen2.5-VL is a multimodal vision-language model (VLM) family developed by the Qwen team (Alibaba Cloud). Building on the strengths of the previous generation, it improves long-video understanding, ultra-long document parsing, and logical reasoning in complex scenes, aiming to provide more general and practical visual interaction capabilities.
- Key features: strong visual perception and alignment, supporting high-resolution images and video inputs longer than 1 hour. A major highlight is the improved “Visual Agent” capability, enabling accurate coordinate grounding, UI interaction, and complex structured data extraction for automation workflows, multimodal search, and high-accuracy visual Q&A.
- Model variant: Qwen2.5-VL-3B-Instruct is a mid-sized (~3B parameters) instruction-tuned model. It provides an excellent balance between capability and compute cost, making it suitable for edge devices, real-time interactive applications, and low-resource development environments.
Environment Setup
Follow the llama.cpp document to prepare llama.cpp.
Quick Start
Download the Model
O6 / O6N
pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf --local_dir ./
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/Qwen2.5-VL-3B-Instruct-NOE test.png --local_dir ./
Run the Model
O6 / O6N
./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./test.png
Full Conversion Workflow
Clone the Model Repository
O6 / O6N
cd llama.cpp
git clone https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Create a Virtual Environment
O6 / O6N
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
Model Conversion
Convert the Text Module
O6 / O6N
python3 ./convert_hf_to_gguf.py ./Qwen2.5-VL-3B-Instruct
Convert the Vision Module
O6 / O6N
python3 ./convert_hf_to_gguf.py --mmproj ./Qwen2.5-VL-3B-Instruct
Model Quantization
This guide uses Q5_K_M quantization.
O6 / O6N
./build/bin/llama-quantize ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-F16.gguf ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf Q5_K_M
Model Test

Test input image
O6 / O6N
./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./Qwen2.5-VL-3B-Instruct/mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./test.png
Model output:
$ ./build/bin/llama-mtmd-cli -m ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf --mmproj ./Qwen2.5-VL-3B-Instruct/mmproj-Qwen2.5-VL-3b-Instruct-F16.gguf -p "Describe this image." --image ./Qwen2.5-VL-3B-Instruct/test.png
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 27 key-value pairs and 434 tensors from ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 VL 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-VL
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: qwen2vl.block_count u32 = 36
llama_model_loader: - kv 7: qwen2vl.context_length u32 = 128000
llama_model_loader: - kv 8: qwen2vl.embedding_length u32 = 2048
llama_model_loader: - kv 9: qwen2vl.feed_forward_length u32 = 11008
llama_model_loader: - kv 10: qwen2vl.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen2vl.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% set image_count = namespace(value=...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 17
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q5_K: 216 tensors
llama_model_loader: - type q6_K: 37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 2.07 GiB (5.75 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 36
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: mrope sections = [16, 24, 24, 0]
print_info: model type = 3B
print_info: model params = 3.09 B
print_info: general.name = Qwen2.5 VL 3B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_Mapped model buffer size = 2116.07 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache: CPU KV buffer size = 144.00 MiB
llama_kv_cache: size = 144.00 MiB ( 4096 cells, 36 layers, 1/1 seqs), K (f16): 72.00 MiB, V (f16): 72.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 304.75 MiB
llama_context: graph nodes = 1231
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name: Qwen2.5 VL 3B Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 519
clip_model_loader: n_kv: 22
clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
load_hparams: projector: qwen2.5vl_merger
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 3420
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 2048
--- vision hparams ---
load_hparams: image_size: 560
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 8
load_hparams: image_min_pixels: 6272
load_hparams: image_max_pixels: 3211264
load_hparams: model size: 1276.39 MiB
load_hparams: metadata size: 0.18 MiB
alloc_compute_meta: warmup with image size = 1288 x 1288
alloc_compute_meta: CPU compute buffer size = 732.56 MiB
alloc_compute_meta: graph splits = 1, nodes = 1092
warmup: flash attention is enabled
main: loading model: ./Qwen2.5-VL-3B-Instruct/Qwen2.5-VL-3B-Instruct-Q5_K_M.gguf
encoding image slice...
image slice encoded in 8425 ms
decoding image batch 1/1, n_tokens_batch = 361
image decoded (batch 1/1) in 13109 ms
The image depicts a single, delicate rose with a soft pink hue, resting on a dark, possibly marble, surface. The rose is positioned near a window, which has a dark frame. The window appears to be letting in some light, creating a contrast between the illuminated rose and the darker surroundings. The overall scene has a serene and somewhat melancholic atmosphere, with the rose being the central focus.
llama_perf_context_print: load time = 497.68 ms
llama_perf_context_print: prompt eval time = 22189.23 ms / 375 tokens ( 59.17 ms per token, 16.90 tokens per second)
llama_perf_context_print: eval time = 9434.97 ms / 80 runs ( 117.94 ms per token, 8.48 tokens per second)
llama_perf_context_print: total time = 31913.30 ms / 455 tokens
llama_perf_context_print: graphs reused = 0