LLaVA 1.6

LLaVA（Large Language and Vision Assistant）是由威斯康星大学麦迪逊分校、微软研究院和哥伦比亚大学等机构的研究人员共同开发的多模态视觉语言模型系列。该系列通过将预训练的视觉编码器与大型语言模型（LLM）进行端到端连接，使模型具备同时处理图像和文本信息的能力，支持图像描述、视觉问答以及多模态对话等任务。

核心特点：模型采用了动态分辨率技术，能够根据图像内容调整输入分辨率，提升了对细小物体、复杂表格及密集文本（OCR）的识别能力。其技术逻辑在于通过线性层将视觉特征映射到语言空间，使语言模型能够直接理解并分析图像中的视觉信息，从而按照人类指令完成跨模态推理任务。
版本说明：本文档演示的 LLaVA 1.6 Vicuna 7B 是该系列中的一个具体型号，基于 Vicuna 7B 语言内核进行构建，参数量约为 70 亿。该版本在 1.5 版的基础上增加了训练数据量并优化了视觉表示，在保证推理速度的同时，可以处理更高像素的图像输入，适用于研究实验、中小型服务器部署及各类视觉交互应用。

环境配置

参考 llama.cpp 文档准备好 llama.cpp 工具。

快速开始

下载模型

O6 / O6N

pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/llava-v1.6-vicuna-7b-gguf llava-v1.6-vicuna-7B-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/llava-v1.6-vicuna-7b-gguf mmproj-model-f16.gguf --local_dir ./

运行模型

O6 / O6N

./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./mmproj-model-f16.gguf -p "Describe this image." --image ./tools/mtmd/test-1.jpeg

完整转换流程

克隆模型仓库

O6 / O6N

cd llama.cpp
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b

创建虚拟环境

O6 / O6N

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
pip3 install -r tools/mtmd/requirements.txt

模型转换

分离模型

O6 / O6N

python3 ./tools/mtmd/legacy-models/llava_surgery_v2.py -C -m ./llava-v1.6-vicuna-7b/

完成之后会在模型目录找到 llava.projector 和 llava.clip 文件。

创建 vit 目录

O6 / O6N

cd ./llava-v1.6-vicuna-7b/
mkdir vit
cp ./llava.clip vit/pytorch_model.bin
cp ./llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json

创建视觉模块

O6 / O6N

python3 ../tools/mtmd/legacy-models/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision

转换文本模块

O6 / O6N

python3 ../examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown

模型量化

这里采用 Q5_K_M 量化。

O6 / O6N

cd ..
./build/bin/llama-quantize ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-F32.gguf ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf Q5_K_M

模型测试

模型测试输入

O6 / O6N

./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./llava-v1.6-vicuna-7b/vit/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg

模型输出：

$ ./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./llava-v1.6-vicuna-7b/vit/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg --chat-template vicuna
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llava v1.6 Vicuna 7b
llama_model_loader: - kv   2:                           general.basename str              = llava-v1.6-vicuna
llama_model_loader: - kv   3:                         general.size_label str              = 7.1B
llama_model_loader: - kv   4:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   5:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   6:                       llama.context_length u32              = 4096
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   8:                          llama.block_count u32              = 32
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 17
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 4.45 GiB (5.68 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 4096
print_info: n_embd_inp       = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 32
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 4096
print_info: n_embd_v_gqa     = 4096
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 6.74 B
print_info: general.name     = Llava v1.6 Vicuna 7b
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  4560.87 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache: size = 2048.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =    92.51 MiB
llama_context: graph nodes  = 999
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
mtmd_cli_context: chat template example:
You are a helpful assistant

USER: Hello
ASSISTANT: Hi there</s>
USER: How are you?
ASSISTANT:
clip_model_loader: model name:   vit-large336-custom
clip_model_loader: description:  image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    378
clip_model_loader: n_kv:         26

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          mlp
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            23
load_hparams: ffn_op:             gelu_quick
load_hparams: projection_dim:     768

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     1
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         595.50 MiB
load_hparams: metadata size:      0.13 MiB
load_tensors: ffn up/down are swapped
alloc_compute_meta: warmup with image size = 336 x 336
alloc_compute_meta:        CPU compute buffer size =    21.55 MiB
alloc_compute_meta: graph splits = 1, nodes = 736
warmup: flash attention is enabled
main: loading model: ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf
encoding image slice...
image slice encoded in 9964 ms
decoding image batch 1/2, n_tokens_batch = 2048
image decoded (batch 1/2) in 177913 ms
decoding image batch 2/2, n_tokens_batch = 832
image decoded (batch 2/2) in 92931 ms

 The image you've provided appears to be a page from The New York Times, dated July 20, 1969. The headline reads "Men Walk on Moon; Astronauts Land on Plain; Collect Rock!" This was a significant event in human history, as it marked the first time humans had set foot on the moon. The article discusses the historic event and the challenges faced by the astronauts during the moon landing. The date of the article is also notable, as it was published just a few days after the Apollo 11 mission, which was the first time humans had landed on the moon.


llama_perf_context_print:        load time =    1022.67 ms
llama_perf_context_print: prompt eval time =  282489.36 ms /  2896 tokens (   97.54 ms per token,    10.25 tokens per second)
llama_perf_context_print:        eval time =   29479.48 ms /   135 runs   (  218.37 ms per token,     4.58 tokens per second)
llama_perf_context_print:       total time =  312180.20 ms /  3031 tokens
llama_perf_context_print:    graphs reused =        134

环境配置​

快速开始​

下载模型​

运行模型​

完整转换流程​

克隆模型仓库​

创建虚拟环境​

模型转换​

分离模型​

创建 vit 目录​

创建视觉模块​

转换文本模块​

模型量化​

模型测试​

环境配置

快速开始

下载模型

运行模型

完整转换流程

克隆模型仓库

创建虚拟环境

模型转换

分离模型

创建 vit 目录

创建视觉模块

转换文本模块

模型量化

模型测试