LLaVA 1.6

LLaVA (Large Language and Vision Assistant) is a multimodal vision-language model (VLM) family developed by researchers from institutions such as the University of Wisconsin–Madison, Microsoft Research, and Columbia University. It connects a pre-trained vision encoder with a large language model (LLM) end-to-end, enabling joint image-and-text understanding for tasks such as image captioning, visual question answering, and multimodal chat.

Key features: uses dynamic-resolution techniques to adapt input resolution to image content, improving recognition of small objects, complex tables, and dense OCR text. The core idea is to map visual features into the language space via a projection layer so the language model can directly understand and reason about visual information.
Model variant: LLaVA 1.6 Vicuna 7B is a specific model built on a Vicuna 7B language backbone (~7B parameters). Compared to v1.5, it uses more training data and improved visual representations, supporting higher-resolution inputs while maintaining good inference speed.

Environment Setup

Follow the llama.cpp document to prepare llama.cpp.

Quick Start

Download the Model

O6 / O6N

pip3 install modelscope
cd llama.cpp
modelscope download --model radxa/llava-v1.6-vicuna-7b-gguf llava-v1.6-vicuna-7B-Q5_K_M.gguf --local_dir ./
modelscope download --model radxa/llava-v1.6-vicuna-7b-gguf mmproj-model-f16.gguf --local_dir ./

Run the Model

O6 / O6N

./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./mmproj-model-f16.gguf -p "Describe this image." --image ./tools/mtmd/test-1.jpeg

Full Conversion Workflow

Clone the Model Repository

O6 / O6N

cd llama.cpp
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b

Create a Virtual Environment

O6 / O6N

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
pip3 install -r tools/mtmd/requirements.txt

Model Conversion

Split the model

O6 / O6N

python3 ./tools/mtmd/legacy-models/llava_surgery_v2.py -C -m ./llava-v1.6-vicuna-7b/

After completion, you should find llava.projector and llava.clip in the model directory.

Create the `vit` directory

O6 / O6N

cd ./llava-v1.6-vicuna-7b/
mkdir vit
cp ./llava.clip vit/pytorch_model.bin
cp ./llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json

Create the vision module

O6 / O6N

python3 ../tools/mtmd/legacy-models/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision

Convert the Text Module

O6 / O6N

python3 ../examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown

Model Quantization

This guide uses Q5_K_M quantization.

O6 / O6N

cd ..
./build/bin/llama-quantize ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-F32.gguf ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf Q5_K_M

Model Test

Test input image

O6 / O6N

./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./llava-v1.6-vicuna-7b/vit/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg

Model output:

$ ./build/bin/llama-mtmd-cli -m ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf --mmproj ./llava-v1.6-vicuna-7b/vit/mmproj-model-f16.gguf -p "What is this picture about?" --image ./tools/mtmd/test-1.jpeg --chat-template vicuna
build: 7110 (3ae282a06) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llava v1.6 Vicuna 7b
llama_model_loader: - kv   2:                           general.basename str              = llava-v1.6-vicuna
llama_model_loader: - kv   3:                         general.size_label str              = 7.1B
llama_model_loader: - kv   4:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv   5:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   6:                       llama.context_length u32              = 4096
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   8:                          llama.block_count u32              = 32
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 17
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 4.45 GiB (5.68 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 4096
print_info: n_embd_inp       = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 32
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 4096
print_info: n_embd_v_gqa     = 4096
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 6.74 B
print_info: general.name     = Llava v1.6 Vicuna 7b
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  4560.87 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache: size = 2048.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =    92.51 MiB
llama_context: graph nodes  = 999
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
mtmd_cli_context: chat template example:
You are a helpful assistant

USER: Hello
ASSISTANT: Hi there</s>
USER: How are you?
ASSISTANT:
clip_model_loader: model name:   vit-large336-custom
clip_model_loader: description:  image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    378
clip_model_loader: n_kv:         26

clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector:          mlp
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            23
load_hparams: ffn_op:             gelu_quick
load_hparams: projection_dim:     768

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     1
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         595.50 MiB
load_hparams: metadata size:      0.13 MiB
load_tensors: ffn up/down are swapped
alloc_compute_meta: warmup with image size = 336 x 336
alloc_compute_meta:        CPU compute buffer size =    21.55 MiB
alloc_compute_meta: graph splits = 1, nodes = 736
warmup: flash attention is enabled
main: loading model: ./llava-v1.6-vicuna-7b/llava-v1.6-vicuna-7B-Q5_K_M.gguf
encoding image slice...
image slice encoded in 9964 ms
decoding image batch 1/2, n_tokens_batch = 2048
image decoded (batch 1/2) in 177913 ms
decoding image batch 2/2, n_tokens_batch = 832
image decoded (batch 2/2) in 92931 ms

 The image you've provided appears to be a page from The New York Times, dated July 20, 1969. The headline reads "Men Walk on Moon; Astronauts Land on Plain; Collect Rock!" This was a significant event in human history, as it marked the first time humans had set foot on the moon. The article discusses the historic event and the challenges faced by the astronauts during the moon landing. The date of the article is also notable, as it was published just a few days after the Apollo 11 mission, which was the first time humans had landed on the moon.


llama_perf_context_print:        load time =    1022.67 ms
llama_perf_context_print: prompt eval time =  282489.36 ms /  2896 tokens (   97.54 ms per token,    10.25 tokens per second)
llama_perf_context_print:        eval time =   29479.48 ms /   135 runs   (  218.37 ms per token,     4.58 tokens per second)
llama_perf_context_print:       total time =  312180.20 ms /  3031 tokens
llama_perf_context_print:    graphs reused =        134

Environment Setup​

Quick Start​

Download the Model​

Run the Model​

Full Conversion Workflow​

Clone the Model Repository​

Create a Virtual Environment​

Model Conversion​

Split the model​

Create the vit directory​

Create the vision module​

Convert the Text Module​

Model Quantization​

Model Test​

Environment Setup

Quick Start

Download the Model

Run the Model

Full Conversion Workflow

Clone the Model Repository

Create a Virtual Environment

Model Conversion

Split the model

Create the `vit` directory

Create the vision module

Convert the Text Module

Model Quantization

Model Test