Commit Graph

2349 Commits

Author SHA1 Message Date
15499eb942 mpt : do not duplicate token_embd.weight on disk (#5670) b2249 2024-02-22 17:05:23 -05:00
96633eeca1 gemma : use more bits for the token_embd.weight tensor (#5650)
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type
b2248
2024-02-22 23:23:46 +02:00
847eedbdb2 py : add Gemma conversion from HF models (#5647)
* py : add gemma conversion from HF models

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update convert-hf-to-gguf.py

Co-authored-by: Aarni Koskela <akx@iki.fi>

* Update convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <jared@nomic.ai>

---------

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
b2247
2024-02-22 23:22:48 +02:00
7e4f339c40 ggml : always define ggml_fp16_t as uint16_t (#5666)
* ggml : always define ggml_fp16_t as uint16_t

ggml-ci

* ggml : cont

ggml-ci

* ggml : cont

* ggml : cont

ggml-ci

* ggml : cont

ggml-ci

* cuda : no longer ggml headers last

ggml-ci

* ggml : fix q6_K FP16 -> FP32 conversion

ggml-ci

* ggml : more FP16 -> FP32 conversion fixes

ggml-ci
b2246
2024-02-22 23:21:39 +02:00
334f76fa38 sync : ggml b2245 2024-02-22 23:21:05 +02:00
efd56b1c21 ggml : 32-bit arm compat (whisper/1891)
* ggml : 32-bit arm compat

* ggml : add ggml_vqtbl1q_s8 impl

* ggml : cont
2024-02-22 23:20:50 +02:00
201294ae17 nix: init singularity and docker images (#5056)
Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.

Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
b2243
2024-02-22 11:44:10 -08:00
5a9e2f60ba py : minor fixes (#5668) 2024-02-22 20:13:25 +02:00
373ee3fbba Add Gemma chat template (#5665)
* add gemma chat template

* gemma: only apply system_prompt on non-model message
b2241
2024-02-22 19:10:21 +01:00
4cb4d8b22d workflows: nix: hardcode cachix ids, build unconditionally (#5663)
GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. 

The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
b2240
2024-02-22 08:32:09 -08:00
3a03541ced minor : fix trailing whitespace (#5638) b2239 2024-02-22 13:54:03 +02:00
56d03d92be readme : update hot topics 2024-02-22 10:35:54 +02:00
a46f50747b server : fallback to chatml, add AlphaMonarch chat template (#5628)
* server: fallback to chatml

* add new chat template

* server: add AlphaMonarch to test chat template

* server: only check model template if there is no custom tmpl

* remove TODO
b2237
2024-02-22 10:33:24 +02:00
c5688c6250 server : clarify some params in the docs (#5640) 2024-02-22 10:27:32 +02:00
4ef245a92a mpt : add optional bias tensors (#5638)
Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
b2235
2024-02-22 10:15:13 +02:00
973053d8b0 llama : fix loading models with shared tok_embd and output (#5651)
ggml-ci
b2234
2024-02-22 00:42:09 +01:00
7c8bcc11dc Add docs for llama_chat_apply_template (#5645)
* add docs for llama_chat_apply_template

* fix typo
b2233
2024-02-22 00:31:00 +01:00
7fe4678b02 llama : fix session save/load with quantized KV (#5649) b2232 2024-02-21 22:52:39 +01:00
ba2135ccae gemma : allow offloading the output tensor (#5646) b2231 2024-02-21 22:18:23 +01:00
89febfed93 examples : do not assume BOS when shifting context (#5622) b2230 2024-02-21 10:33:54 -05:00
5022cf242d sync : ggml 2024-02-21 16:52:52 +02:00
1ecea255eb server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
b2228
2024-02-21 15:47:48 +01:00
a00a35cef9 readme : add LocalAI to the availables UI (#5629) 2024-02-21 16:39:10 +02:00
eccd7a26dd sync : ggml (#5633)
* ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <bssrdf@gmail.com>

* ggml : compute forward no longer pass src tensors (ggml/729)

* sync : ggml

ggml-ci

---------

Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
b2226
2024-02-21 16:17:10 +02:00
c14f72db9c readme : update hot topics 2024-02-21 15:39:54 +02:00
cc6cac08e3 llava : add --skip-unknown to 1.6 convert.py (#5632)
This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.

The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-21 15:36:57 +02:00
580111d42b llama : add gemma model (#5631)
There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
b2223
2024-02-21 15:08:22 +02:00
88c46cbdac [SYCL] conext add name (#5624)
* [SYCL] conext add name

* name should start with SYCL*
b2222
2024-02-21 17:52:06 +08:00
a14679cc30 IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b2221
2024-02-21 11:39:52 +02:00
6560bed3f0 server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
b2220
2024-02-20 21:07:22 +02:00
06bf2cf8c4 make : fix debug build with CUDA (#5616) b2219 2024-02-20 20:06:17 +01:00
4ed8e4fbef llava : add explicit instructions for llava-1.6 (#5611)
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20 19:30:27 +02:00
9c405c9f9a Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2217
2024-02-20 15:58:27 +01:00
5207b3fbc5 readme : update UI list (#5605)
* Add maid to ui list

* Specify licence
2024-02-20 12:00:23 +02:00
8dbbd75754 metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2215
2024-02-20 11:58:36 +02:00
c0a8c6db37 server : health endpoint configurable failure on no slot (#5594) b2214 2024-02-20 09:48:19 +02:00
b9111bd209 Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
b2213
2024-02-20 12:31:25 +05:30
633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries
Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000
b2212
2024-02-19 14:49:49 -08:00
22f83f0c38 Enable Vulkan MacOS CI 2024-02-19 14:49:49 -08:00
bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-19 14:49:49 -08:00
f50db6ae0b Add check for VK_KHR_portability_enumeration for MoltenVK support 2024-02-19 14:49:49 -08:00
d8c054517d Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-19 14:49:49 -08:00
42f664a382 Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-19 14:49:49 -08:00
5dde540897 Allow for Vulkan build with Accelerate.
Closes #5304
2024-02-19 14:49:49 -08:00
40c3a6c1e1 cuda : ignore peer access already enabled errors (#5597)
* cuda : ignore peer access already enabled errors

* fix hip
b2205
2024-02-19 23:40:26 +01:00
f24ed14ee0 make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) b2204 2024-02-19 15:54:12 -05:00
9d679f0fcc examples : support minItems/maxItems in JSON grammar converter (#5039)
* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2203
2024-02-19 16:14:07 +02:00
1387cf60f7 llava : remove extra cont (#5587) b2202 2024-02-19 15:23:17 +02:00
6fd413791a llava : replace ggml_cpy with ggml_cont b2201 2024-02-19 15:09:43 +02:00
337c9cbd52 sync : ggml
ggml-ci
2024-02-19 15:09:43 +02:00