MiniCPM-V 2.6

MiniCPM-V is a multimodal vision-language model (VLM) family developed by ModelBest and the NLP lab at Tsinghua University. It focuses on enabling multimodal capabilities on edge devices through model architectures that can process images and respond to text instructions, covering use cases such as image understanding, multi-turn conversations, and video analysis.

Key features: supports images with different aspect ratios and includes video understanding (summarization and Q&A). It improves pixel-level spatial perception for coordinate grounding and object tracking. The model is optimized for edge deployment and can handle complex tables, long images, and OCR-style text extraction with reduced memory usage.
Model variant: MiniCPM-V 2.6 is a concrete variant in the series with roughly 8B parameters. It supports single-image, multi-image, and short-video understanding and is suitable for mobile/edge deployments where latency and compute cost matter.

Environment Setup

Follow the llama.cpp document to prepare llama.cpp.

Quick Start

Download the Model

O6 / O6N

pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/minicpm-v-2_6-gguf ggml-model-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/minicpm-v-2_6-gguf mmproj-model-f16.gguf --local_dir ./

Run the Model

O6 / O6N

./build/bin/llama-mtmd-cli -m ./ggml-model-Q5_K_M.gguf --mmproj ./mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg

Full Conversion Workflow

Clone the Model Repository

O6 / O6N

cd llama.cpp
hf download openbmb/MiniCPM-V-2_6 --local-dir ./MiniCPM-V-2_6

Create a Virtual Environment

O6 / O6N

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Model Conversion

Split the model

O6 / O6N

python3 ./tools/mtmd/legacy-models/minicpmv-surgery.py -m ./MiniCPM-V-2_6

Convert the Vision Module

O6 / O6N

python3 ./tools/mtmd/legacy-models/minicpmv-convert-image-encoder-to-gguf.py -m ./MiniCPM-V-2_6 --minicpmv-projector ./MiniCPM-V-2_6/minicpmv.projector --output-dir ./MiniCPM-V-2_6/ --minicpmv_version 3

Convert the Text Module

O6 / O6N

python3 ./convert_hf_to_gguf.py ./MiniCPM-V-2_6/model

Model Quantization

This guide uses Q5_K_M quantization.

O6 / O6N

./build/bin/llama-quantize ./MiniCPM-V-2_6/model/Model-7.6B-F16.gguf ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf Q5_K_M

Model Test

Test input image

O6 / O6N

./build/bin/llama-mtmd-cli -m ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf --mmproj ./MiniCPM-V-2_6/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg

Model output:

$ ./build/bin/llama-mtmd-cli -m ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf --mmproj ./MiniCPM-V-2_6/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 24 key-value pairs and 339 tensors from ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 7.6B
llama_model_loader: - kv   4:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   5:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   6:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   7:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   8:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   9:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  10:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,151666]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,151666]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 128244
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - kv  23:                          general.file_type u32              = 17
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 5.06 GiB (5.71 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load: special tokens cache size = 24
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_embd_inp       = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 7.61 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 151666
print_info: n_merges         = 151387
print_info: BOS token        = 151644 '<|im_start|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: UNK token        = 128244 '<unk>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  5184.87 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache:        CPU KV buffer size =   224.00 MiB
llama_kv_cache: size =  224.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   303.22 MiB
llama_context: graph nodes  = 959
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:
clip_model_loader: description:  image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    455
clip_model_loader: n_kv:         20

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          resampler
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     0

--- vision hparams ---
load_hparams: image_size:         448
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   3
load_hparams: n_merge:            0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         996.02 MiB
load_hparams: metadata size:      0.16 MiB
load_tensors: ffn up/down are swapped
alloc_compute_meta: warmup with image size = 448 x 448
alloc_compute_meta:        CPU compute buffer size =    55.81 MiB
alloc_compute_meta: graph splits = 1, nodes = 893
warmup: flash attention is enabled
main: loading model: ./MiniCPM-V-2_6/model/ggml-model-Q5_K_M.gguf
encoding image slice...
image slice encoded in 5523 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5046 ms
encoding image slice...
image slice encoded in 5550 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5063 ms
encoding image slice...
image slice encoded in 5540 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 5083 ms

The image is a black and white photograph of a newspaper, specifically the front page of The New York Times from July 21, 1969. The headline, written in bold, large font, proclaims "Men Walk on Moon". Below the headline, there's a subheading that reads "Astronauts Land on Plain; Collect Rocks, Plant Flag". The newspaper is open to a page with a photograph of a man walking on the moon. The man is wearing a spacesuit and is seen walking on the lunar surface. The photograph is accompanied by a caption that reads "A Powdery Surface Is Closely Explored". The newspaper is a significant historical artifact, marking the momentous event of the first human landing on the moon.


llama_perf_context_print:        load time =     936.29 ms
llama_perf_context_print: prompt eval time =   33699.48 ms /   212 tokens (  158.96 ms per token,     6.29 tokens per second)
llama_perf_context_print:        eval time =   27574.13 ms /   156 runs   (  176.76 ms per token,     5.66 tokens per second)
llama_perf_context_print:       total time =   61703.52 ms /   368 tokens
llama_perf_context_print:    graphs reused =        154

Environment Setup​

Quick Start​

Download the Model​

Run the Model​

Full Conversion Workflow​

Clone the Model Repository​

Create a Virtual Environment​

Model Conversion​

Split the model​

Convert the Vision Module​

Convert the Text Module​

Model Quantization​

Model Test​

Environment Setup

Quick Start

Download the Model

Run the Model

Full Conversion Workflow

Clone the Model Repository

Create a Virtual Environment

Model Conversion

Split the model

Convert the Vision Module

Convert the Text Module

Model Quantization

Model Test