Commit Graph

2432 Commits

Author SHA1 Message Date
7fe4678b02 llama : fix session save/load with quantized KV (#5649) b2232 2024-02-21 22:52:39 +01:00
ba2135ccae gemma : allow offloading the output tensor (#5646) b2231 2024-02-21 22:18:23 +01:00
89febfed93 examples : do not assume BOS when shifting context (#5622) b2230 2024-02-21 10:33:54 -05:00
5022cf242d sync : ggml 2024-02-21 16:52:52 +02:00
1ecea255eb server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
b2228
2024-02-21 15:47:48 +01:00
a00a35cef9 readme : add LocalAI to the availables UI (#5629) 2024-02-21 16:39:10 +02:00
eccd7a26dd sync : ggml (#5633)
* ggml : fix conv_2d batch mode (ggml/737)

Co-authored-by: bssrdf <bssrdf@gmail.com>

* ggml : compute forward no longer pass src tensors (ggml/729)

* sync : ggml

ggml-ci

---------

Co-authored-by: bssrdf <merlintiger@hotmail.com>
Co-authored-by: bssrdf <bssrdf@gmail.com>
b2226
2024-02-21 16:17:10 +02:00
c14f72db9c readme : update hot topics 2024-02-21 15:39:54 +02:00
cc6cac08e3 llava : add --skip-unknown to 1.6 convert.py (#5632)
This commit adds the `--skip-unknown` option to the convert.py script
and removes the saving of the updated checkpoints to avoid updating
possibly checked out files.

The motivation for this change is that this was done for 1.5
in Commit fc0c8d286a ("llava :
update surgery script to not remove tensors") and makes the examples
more consistent.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-21 15:36:57 +02:00
580111d42b llama : add gemma model (#5631)
There are couple things in this architecture:

1. Shared input and output embedding parameters.
2. Key length and value length are not derived from `n_embd`.

More information about the models can be found at
https://ai.google.dev/gemma. GGUFs can be downloaded from
https://huggingface.co/google.
b2223
2024-02-21 15:08:22 +02:00
88c46cbdac [SYCL] conext add name (#5624)
* [SYCL] conext add name

* name should start with SYCL*
b2222
2024-02-21 17:52:06 +08:00
a14679cc30 IQ4_NL: 4-bit non-linear quants with blocks of 32 (#5590)
* iq4_nl: squash commits for easier rebase

* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels

* iq4_nl: Fix after merging with master

* iq4_nl: another fix after merging with master

* Use IQ4_NL instead of Q4_K when using k-quants is not possible

* Fix typo that makes several tests fail

* It was the ggml_vdotq thing missed inside the brackets

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b2221
2024-02-21 11:39:52 +02:00
6560bed3f0 server : support llava 1.6 (#5553)
* server: init working 1.6

* move clip_image to header

* remove commented code

* remove c++ style from header

* remove todo

* expose llava_image_embed_make_with_clip_img

* fix zig build
b2220
2024-02-20 21:07:22 +02:00
06bf2cf8c4 make : fix debug build with CUDA (#5616) b2219 2024-02-20 20:06:17 +01:00
4ed8e4fbef llava : add explicit instructions for llava-1.6 (#5611)
This commit contains a suggestion for the README.md in the llava
example. The suggestion adds explicit instructions for how to convert
a llava-1.6 model and run it using llava-cli.

The motivation for this is that having explicit instructions similar to
the 1.5 instructions will make it easier for users to try this out.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-20 19:30:27 +02:00
9c405c9f9a Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2217
2024-02-20 15:58:27 +01:00
5207b3fbc5 readme : update UI list (#5605)
* Add maid to ui list

* Specify licence
2024-02-20 12:00:23 +02:00
8dbbd75754 metal : add build system support for embedded metal library (#5604)
* add build support for embedded metal library

* Update Makefile

---------

Co-authored-by: Haoxiang Fei <feihaoxiang@idea.edu.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2215
2024-02-20 11:58:36 +02:00
c0a8c6db37 server : health endpoint configurable failure on no slot (#5594) b2214 2024-02-20 09:48:19 +02:00
b9111bd209 Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
b2213
2024-02-20 12:31:25 +05:30
633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries
Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000
b2212
2024-02-19 14:49:49 -08:00
22f83f0c38 Enable Vulkan MacOS CI 2024-02-19 14:49:49 -08:00
bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-19 14:49:49 -08:00
f50db6ae0b Add check for VK_KHR_portability_enumeration for MoltenVK support 2024-02-19 14:49:49 -08:00
d8c054517d Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-19 14:49:49 -08:00
42f664a382 Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-19 14:49:49 -08:00
5dde540897 Allow for Vulkan build with Accelerate.
Closes #5304
2024-02-19 14:49:49 -08:00
40c3a6c1e1 cuda : ignore peer access already enabled errors (#5597)
* cuda : ignore peer access already enabled errors

* fix hip
b2205
2024-02-19 23:40:26 +01:00
f24ed14ee0 make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) b2204 2024-02-19 15:54:12 -05:00
9d679f0fcc examples : support minItems/maxItems in JSON grammar converter (#5039)
* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2203
2024-02-19 16:14:07 +02:00
1387cf60f7 llava : remove extra cont (#5587) b2202 2024-02-19 15:23:17 +02:00
6fd413791a llava : replace ggml_cpy with ggml_cont b2201 2024-02-19 15:09:43 +02:00
337c9cbd52 sync : ggml
ggml-ci
2024-02-19 15:09:43 +02:00
a3145bdc30 ggml-alloc : apply ggml/731 2024-02-19 15:09:43 +02:00
890559ab28 metal : option to embed MSL source into compiled binary (whisper/1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process
2024-02-19 15:09:43 +02:00
d0e3ce51f4 ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
b2197
2024-02-19 14:45:41 +02:00
68a6b98b3c make : fix CUDA build (#5580) b2196 2024-02-19 13:41:51 +02:00
70d45af0ef readme : fix typo in README-sycl.md (#5353) 2024-02-19 12:37:10 +02:00
13e2c771aa cmake : remove obsolete sycl compile flags (#5581)
* rm unwanted sycl compile options

* fix bug

* fix bug

* format fix
b2194
2024-02-19 11:15:18 +02:00
f53119cec4 minor : fix trailing whitespace (#5538) b2193 2024-02-19 10:34:10 +02:00
7084755396 llava : avoid changing the original BakLLaVA model (#5577)
This is a follup of Commit fc0c8d286a
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.

I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-19 10:31:59 +02:00
4480542b22 baby-llama : allocate graphs in ggml_context (#5573)
* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2191
2024-02-19 10:25:38 +02:00
11b12de39b llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"
b2190
2024-02-19 10:23:37 +02:00
3a9cb4ca64 cuda, metal : fix nans in soft_max (#5574)
* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2189
2024-02-19 10:04:45 +02:00
769a716e30 readme : update (#5572)
Added 1.5-bit on README.md
2024-02-19 09:39:31 +02:00
f0d1fafc02 ggml : android and old glibc NUMA incompatibility bugfixes (#5557)
* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
b2187
2024-02-19 09:38:32 +02:00
a0c2dad9d4 build : pass all warning flags to nvcc via -Xcompiler (#5570)
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
b2186
2024-02-18 16:21:52 -05:00
14278f55d2 ggml : restore vec dot stride arg names (#5453) b2185 2024-02-18 22:58:57 +02:00
b1de96824b ci : fix wikitext url + compile warnings (#5569)
ggml-ci
b2184
2024-02-18 22:39:30 +02:00
7ad554f90e metal : fix unused warnings (#0) 2024-02-18 21:39:58 +02:00