2023-10-12 18:23:18 +03:00
# LLaVA
2024-02-14 15:49:42 +01:00
Currently this implementation supports [llava-v1.5 ](https://huggingface.co/liuhaotian/llava-v1.5-7b ) variants,
as well as llava-1.6 [llava-v1.6 ](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2 ) variants.
2023-10-12 18:23:18 +03:00
The pre-converted [7b ](https://huggingface.co/mys/ggml_llava-v1.5-7b )
and [13b ](https://huggingface.co/mys/ggml_llava-v1.5-13b )
models are available.
2024-02-14 15:49:42 +01:00
For llava-1.6 a variety of prepared gguf models are available as well [7b-34b ](https://huggingface.co/cmp-nct/llava-1.6-gguf )
2023-10-12 18:23:18 +03:00
After API is confirmed, more models will be supported / uploaded.
## Usage
2024-06-13 00:41:52 +01:00
Build with cmake or run `make llama-llava-cli` to build it.
2023-10-12 18:23:18 +03:00
2024-06-13 00:41:52 +01:00
After building, run: `./llama-llava-cli` to see the usage. For example:
2023-10-12 18:23:18 +03:00
```sh
2024-06-13 00:41:52 +01:00
./llama-llava-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf --image path/to/an/image.jpg
2023-10-12 18:23:18 +03:00
```
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
2024-02-14 15:49:42 +01:00
**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
2023-10-12 18:23:18 +03:00
llava : support v1.6 (#5267)
* Create llava-survery-v2.py
* Update convert-image-encoder-to-gguf.py
* Update convert-image-encoder-to-gguf.py
* Rename llava-survery-v2.py to llava-surgery-v2.py
* Update convert-image-encoder-to-gguf.py
will now search for projector
* Update convert-image-encoder-to-gguf.py
whoops
* Update llava-surgery-v2.py
* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening
* whitespace corrections
* ws
* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
* ws
* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli
* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
* ws
* convert : skip unknown tensors (need for LLaVA)
* llava : update readme
* llava : fix compile warnings
* llava : style
* convert : add --skip-unknown CLI arg
* server : remove clip structs
* bugfix for non llava-1.6
It should now work with llava-1.5 as well
* clip : minor code rearrange
* llava : update readme a bit
---------
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 08:38:35 +01:00
## LLaVA 1.5
2023-10-12 18:23:18 +03:00
2024-04-12 10:52:36 +02:00
1. Clone a LLaVA and a CLIP model ([available options ](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md )). For example:
2023-10-12 18:23:18 +03:00
```sh
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
git clone https://huggingface.co/openai/clip-vit-large-patch14-336
```
2024-02-09 14:00:59 +01:00
2. Install the required Python packages:
```sh
pip install -r examples/llava/requirements.txt
```
2024-07-05 07:53:33 +03:00
3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
2023-10-12 18:23:18 +03:00
```sh
2024-07-05 07:53:33 +03:00
python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
2023-10-12 18:23:18 +03:00
```
2024-07-05 07:53:33 +03:00
4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
2023-10-12 18:23:18 +03:00
```sh
2024-07-05 07:53:33 +03:00
python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
2023-10-12 18:23:18 +03:00
```
2024-07-05 07:53:33 +03:00
5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
2023-10-12 18:23:18 +03:00
```sh
2024-07-05 07:53:33 +03:00
python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
2023-10-12 18:23:18 +03:00
```
2024-05-07 17:20:33 +02:00
Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
2023-10-12 18:23:18 +03:00
2024-02-14 15:49:42 +01:00
## LLaVA 1.6 gguf conversion
2024-02-20 18:30:27 +01:00
1) First clone a LLaVA 1.6 model:
```console
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
```
2024-03-14 04:18:23 -07:00
2) Install the required Python packages:
```sh
pip install -r examples/llava/requirements.txt
```
2024-07-05 07:53:33 +03:00
3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
2024-02-20 18:30:27 +01:00
```console
2024-07-05 07:53:33 +03:00
python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
2024-02-20 18:30:27 +01:00
```
2024-02-14 15:49:42 +01:00
- you will find a llava.projector and a llava.clip file in your model directory
2024-03-14 04:18:23 -07:00
4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
2024-02-20 18:30:27 +01:00
```console
mkdir vit
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
```
2024-03-14 04:18:23 -07:00
5) Create the visual gguf model:
2024-02-20 18:30:27 +01:00
```console
2024-07-05 07:53:33 +03:00
python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
2024-02-20 18:30:27 +01:00
```
2024-02-14 15:49:42 +01:00
- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
2024-02-20 18:30:27 +01:00
2024-03-14 04:18:23 -07:00
6) Then convert the model to gguf format:
2024-02-20 18:30:27 +01:00
```console
2024-07-05 07:53:33 +03:00
python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
2024-02-20 18:30:27 +01:00
```
2024-06-13 00:41:52 +01:00
7) And finally we can run the llava cli using the 1.6 model version:
2024-02-20 18:30:27 +01:00
```console
2024-06-13 00:41:52 +01:00
./llama-llava-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf --image some-image.jpg -c 4096
2024-02-20 18:30:27 +01:00
```
2024-02-14 15:49:42 +01:00
**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
2025-02-24 09:09:51 -07:00
2024-02-14 15:49:42 +01:00
**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
2025-02-24 09:09:51 -07:00
**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
```python
import os
import transformers
model_path = ...
llm_export_path = ...
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
tokenizer.save_pretrained(llm_export_path)
model.language_model.save_pretrained(llm_export_path)
```
Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
2024-02-14 15:49:42 +01:00
## llava-cli templating and llava-1.6 prompting
llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."`
For llava-1.5 models which are not vicuna (mistral and Yi) you need to adapt system prompt as well as user prompt, for this purpose llava-cli has a basic templating system:
**For Mistral and using llava-cli binary:**
Add this: `-p "<image>\nUSER:\nProvide a full description.\nASSISTANT:\n"`
The mistral template for llava-1.6 seems to be no system print and a USER/ASSISTANT role
**For the 34B this should work:**
Add this: `-e -p <|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nProvide a full description.<|im_end|><|im_start|>assistant\n`
## How to know if you are running in llava-1.5 or llava-1.6 mode
When running llava-cli you will see a visual information right before the prompt is being processed:
**Llava-1.5:**
`encode_image_with_clip: image embedding created: 576 tokens`
**Llava-1.6 (anything above 576):**
`encode_image_with_clip: image embedding created: 2880 tokens`
Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6
llava : support v1.6 (#5267)
* Create llava-survery-v2.py
* Update convert-image-encoder-to-gguf.py
* Update convert-image-encoder-to-gguf.py
* Rename llava-survery-v2.py to llava-surgery-v2.py
* Update convert-image-encoder-to-gguf.py
will now search for projector
* Update convert-image-encoder-to-gguf.py
whoops
* Update llava-surgery-v2.py
* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening
* whitespace corrections
* ws
* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
* ws
* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli
* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
* ws
* convert : skip unknown tensors (need for LLaVA)
* llava : update readme
* llava : fix compile warnings
* llava : style
* convert : add --skip-unknown CLI arg
* server : remove clip structs
* bugfix for non llava-1.6
It should now work with llava-1.5 as well
* clip : minor code rearrange
* llava : update readme a bit
---------
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 08:38:35 +01:00
2023-10-12 18:23:18 +03:00
## TODO
llava : support v1.6 (#5267)
* Create llava-survery-v2.py
* Update convert-image-encoder-to-gguf.py
* Update convert-image-encoder-to-gguf.py
* Rename llava-survery-v2.py to llava-surgery-v2.py
* Update convert-image-encoder-to-gguf.py
will now search for projector
* Update convert-image-encoder-to-gguf.py
whoops
* Update llava-surgery-v2.py
* Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
Clip: bicubic resize function
Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
convert-image-encoder: fixed image-grid flattening
* whitespace corrections
* ws
* Tensors are now properly permuted.
Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
* ws
* added verbose_prompt support into cli
added stopwords for llava-1.6 into cli
* moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
* ws
* convert : skip unknown tensors (need for LLaVA)
* llava : update readme
* llava : fix compile warnings
* llava : style
* convert : add --skip-unknown CLI arg
* server : remove clip structs
* bugfix for non llava-1.6
It should now work with llava-1.5 as well
* clip : minor code rearrange
* llava : update readme a bit
---------
Co-authored-by: John <cmt-nct@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-14 08:38:35 +01:00
- [x] Support non-CPU backend for the image encoding part.
2023-10-12 18:23:18 +03:00
- [ ] Support different sampling methods.
- [ ] Support more model variants.