MiniCPM-V 2.6
MiniCPM-V is a multimodal vision-language model (VLM) family developed by ModelBest and the NLP lab at Tsinghua University. It focuses on enabling multimodal capabilities on edge devices through model architectures that can process images and respond to text instructions, covering use cases such as image understanding, multi-turn conversations, and video analysis.
- Key features: supports images with different aspect ratios and includes video understanding (summarization and Q&A). It improves pixel-level spatial perception for coordinate grounding and object tracking. The model is optimized for edge deployment and can handle complex tables, long images, and OCR-style text extraction with reduced memory usage.
- Model variant: MiniCPM-V 2.6 is a concrete variant in the series with roughly 8B parameters. It supports single-image, multi-image, and short-video understanding and is suitable for mobile/edge deployments where latency and compute cost matter.
Environment Setup
Follow the llama.cpp document to prepare llama.cpp.
Quick Start
Download the Model
O6 / O6N
pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/minicpm-v-2_6-gguf ggml-model-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/minicpm-v-2_6-gguf mmproj-model-f16.gguf --local_dir ./
Run the Model
O6 / O6N
./build/bin/llama-mtmd-cli -m ./ggml-model-Q5_K_M.gguf --mmproj ./mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg
Full Conversion Workflow
Clone the Model Repository
O6 / O6N
cd llama.cpp
hf download openbmb/MiniCPM-V-2_6 --local-dir ./MiniCPM-V-2_6
Create a Virtual Environment
O6 / O6N
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
Model Conversion
Split the model
O6 / O6N
python3 ./tools/mtmd/legacy-models/minicpmv-surgery.py -m ./MiniCPM-V-2_6
Convert the Vision Module
O6 / O6N
python3 ./tools/mtmd/legacy-models/minicpmv-convert-image-encoder-to-gguf.py -m ./MiniCPM-V-2_6 --minicpmv-projector ./MiniCPM-V-2_6/minicpmv.projector --output-dir ./MiniCPM-V-2_6/ --minicpmv_version 3
Convert the Text Module
O6 / O6N
python3 ./convert_hf_to_gguf.py ./MiniCPM-V-2_6/model
Model Quantization
This guide uses Q5_K_M quantization.
O6 / O6N
./build/bin/llama-quantize ./MiniCPM-V-2_6/model/Model-7.6B-F16.gguf ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf Q5_K_M
Model Test

Test input image
O6 / O6N
./build/bin/llama-mtmd-cli -m ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf --mmproj ./MiniCPM-V-2_6/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg
Model output:
$ ./build/bin/llama-mtmd-cli -m ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf --mmproj ./MiniCPM-V-2_6/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 24 key-value pairs and 339 tensors from ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 7.6B
llama_model_loader: - kv 4: qwen2.block_count u32 = 28
llama_model_loader: - kv 5: qwen2.context_length u32 = 32768
llama_model_loader: - kv 6: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 7: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 8: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 9: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 10: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,151666] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,151666] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 151644
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 128244
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: general.file_type u32 = 17
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q5_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 5.06 GiB (5.71 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: special tokens cache size = 24
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_embd_inp = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7.61 B
print_info: general.name = Model
print_info: vocab type = BPE
print_info: n_vocab = 151666
print_info: n_merges = 151387
print_info: BOS token = 151644 '<|im_start|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: UNK token = 128244 '<unk>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_Mapped model buffer size = 5184.87 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache: CPU KV buffer size = 224.00 MiB
llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 303.22 MiB
llama_context: graph nodes = 959
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name:
clip_model_loader: description: image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 455
clip_model_loader: n_kv: 20
clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector: resampler
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 0
--- vision hparams ---
load_hparams: image_size: 448
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 3
load_hparams: n_merge: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 996.02 MiB
load_hparams: metadata size: 0.16 MiB
load_tensors: ffn up/down are swapped
alloc_compute_meta: warmup with image size = 448 x 448
alloc_compute_meta: CPU compute buffer size = 55.81 MiB
alloc_compute_meta: graph splits = 1, nodes = 893
warmup: flash attention is enabled
main: loading model: ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf
encoding image slice...
image slice encoded in 5523 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5046 ms
encoding image slice...
image slice encoded in 5550 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5063 ms
encoding image slice...
image slice encoded in 5540 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5083 ms
The image is a black and white photograph of a newspaper, specifically the front page of The New York Times from July 21, 1969. The headline, written in bold, large font, proclaims "Men Walk on Moon". Below the headline, there's a subheading that reads "Astronauts Land on Plain; Collect Rocks, Plant Flag". The newspaper is open to a page with a photograph of a man walking on the moon. The man is wearing a spacesuit and is seen walking on the lunar surface. The photograph is accompanied by a caption that reads "A Powdery Surface Is Closely Explored". The newspaper is a significant historical artifact, marking the momentous event of the first human landing on the moon.
llama_perf_context_print: load time = 936.29 ms
llama_perf_context_print: prompt eval time = 33699.48 ms / 212 tokens ( 158.96 ms per token, 6.29 tokens per second)
llama_perf_context_print: eval time = 27574.13 ms / 156 runs ( 176.76 ms per token, 5.66 tokens per second)
llama_perf_context_print: total time = 61703.52 ms / 368 tokens
llama_perf_context_print: graphs reused = 154