llama.cpp/docs/multimodal.md

# Multimodal

llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
- [llama-mtmd-cli](../tools/mtmd/README.md)
- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API

To enable it, can use use one of the 2 methods below:

- Use `-hf` option with a supported model (see a list of pre-quantized model below)
    - To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
    - To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively

By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`

For example:

```sh
# simple usage with CLI
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF

# simple usage with server
llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# using local file
llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf

# no GPU offload
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
```

## Pre-quantized models

These are ready-to-use models, most of them come with `Q4_K_M` quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/ggml-org

Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`

NOTE: some models may require large context window, for example: `-c 8192`

```sh
# Gemma 3
(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF

# SmolVLM
(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

# Pixtral 12B
(tool_name) -hf ggml-org/pixtral-12b-GGUF

# Qwen 2 VL
(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF

# Qwen 2.5 VL
(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF

# Mistral Small 3.1 24B (IQ2_M quantization)
(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF

# InternVL 2.5 and 3
(tool_name) -hf ggml-org/InternVL2_5-1B-GGUF
(tool_name) -hf ggml-org/InternVL2_5-4B-GGUF
(tool_name) -hf ggml-org/InternVL3-1B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-2B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-8B-Instruct-GGUF
(tool_name) -hf ggml-org/InternVL3-14B-Instruct-GGUF
```
server : vision support via libmtmd (#12898) * server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2025-05-09 19:29:37 +02:00			`# Multimodal`

			llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
			`- [llama-mtmd-cli](../tools/mtmd/README.md)`
			- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API

			`To enable it, can use use one of the 2 methods below:`

server : update docs (#13432) 2025-05-10 18:44:49 +02:00			- Use `-hf` option with a supported model (see a list of pre-quantized model below)
server : vision support via libmtmd (#12898) * server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2025-05-09 19:29:37 +02:00			- To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
			- To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
			- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively

			By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`

			`For example:`

			```sh
			`# simple usage with CLI`
			`llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF`

			`# simple usage with server`
			`llama-server -hf ggml-org/gemma-3-4b-it-GGUF`

			`# using local file`
			`llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf`

			`# no GPU offload`
			`llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload`
			```

			`## Pre-quantized models`

docs: Update link to ggml-org in multimodal.md (#13513) * Update multimodal.md Minor change to include the huggingface link * Update docs/multimodal.md --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> 2025-05-14 09:59:12 +02:00			These are ready-to-use models, most of them come with `Q4_K_M` quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/ggml-org
server : vision support via libmtmd (#12898) * server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2025-05-09 19:29:37 +02:00
			Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`

			NOTE: some models may require large context window, for example: `-c 8192`

			```sh
			`# Gemma 3`
			`(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF`
			`(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF`
			`(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF`

			`# SmolVLM`
			`(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF`
			`(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF`
			`(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF`
			`(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF`
			`(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF`

			`# Pixtral 12B`
			`(tool_name) -hf ggml-org/pixtral-12b-GGUF`

			`# Qwen 2 VL`
			`(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF`

			`# Qwen 2.5 VL`
			`(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF`

			`# Mistral Small 3.1 24B (IQ2_M quantization)`
			`(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF`
mtmd : support InternVL 2.5 and 3 (#13422) * convert : internvl support * InternVL3-1B working * fix regression * rm mobilevlm from test * fix conversion * add test for internvl * add to list of pre-quant * restore boi/eoi check * add clarify comment for norm eps 2025-05-10 16:26:42 +02:00
			`# InternVL 2.5 and 3`
			`(tool_name) -hf ggml-org/InternVL2_5-1B-GGUF`
docs : Fix typo in InternVL3 model name (#13440) 2025-05-10 22:26:46 +02:00			`(tool_name) -hf ggml-org/InternVL2_5-4B-GGUF`
mtmd : support InternVL 2.5 and 3 (#13422) * convert : internvl support * InternVL3-1B working * fix regression * rm mobilevlm from test * fix conversion * add test for internvl * add to list of pre-quant * restore boi/eoi check * add clarify comment for norm eps 2025-05-10 16:26:42 +02:00			`(tool_name) -hf ggml-org/InternVL3-1B-Instruct-GGUF`
			`(tool_name) -hf ggml-org/InternVL3-2B-Instruct-GGUF`
docs : Fix typo in InternVL3 model name (#13440) 2025-05-10 22:26:46 +02:00			`(tool_name) -hf ggml-org/InternVL3-8B-Instruct-GGUF`
mtmd : support InternVL 2.5 and 3 (#13422) * convert : internvl support * InternVL3-1B working * fix regression * rm mobilevlm from test * fix conversion * add test for internvl * add to list of pre-quant * restore boi/eoi check * add clarify comment for norm eps 2025-05-10 16:26:42 +02:00			`(tool_name) -hf ggml-org/InternVL3-14B-Instruct-GGUF`
server : vision support via libmtmd (#12898) * server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2025-05-09 19:29:37 +02:00			```