Commit Graph

3114 Commits

Author SHA1 Message Date
c0a8c6db37 server : health endpoint configurable failure on no slot (#5594) b2214 2024-02-20 09:48:19 +02:00
b9111bd209 Update ggml_sycl_op_mul_mat_vec_q (#5502)
* Update ggml_sycl_op_mul_mat_vec_q

* Apply suggestions from code review

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

* revert suggestion on macro

* fix bug

* Add quant type GGML_TYPE_IQ1_S to unsupported

* fix format

---------

Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
b2213
2024-02-20 12:31:25 +05:30
633782b8d9 nix: now that we can do so, allow MacOS to build Vulkan binaries
Author:    Philip Taron <philip.taron@gmail.com>
Date:      Tue Feb 13 20:28:02 2024 +0000
b2212
2024-02-19 14:49:49 -08:00
22f83f0c38 Enable Vulkan MacOS CI 2024-02-19 14:49:49 -08:00
bb9dcd560a Refactor validation and enumeration platform checks into functions to clean up ggml_vk_instance_init() 2024-02-19 14:49:49 -08:00
f50db6ae0b Add check for VK_KHR_portability_enumeration for MoltenVK support 2024-02-19 14:49:49 -08:00
d8c054517d Add preprocessor checks for Apple devices.
Based on work by @rbourgeat in https://github.com/ggerganov/llama.cpp/pull/5322/files
2024-02-19 14:49:49 -08:00
42f664a382 Resolve ErrorIncompatibleDriver with Vulkan on MacOS.
Refs:
- https://chat.openai.com/share/7020ce72-65fc-45ec-b7be-9d9d798a5f3f
- https://github.com/SaschaWillems/Vulkan/issues/954
- https://github.com/haasn/libplacebo/issues/128
- https://github.com/KhronosGroup/Vulkan-Samples/issues/476
2024-02-19 14:49:49 -08:00
5dde540897 Allow for Vulkan build with Accelerate.
Closes #5304
2024-02-19 14:49:49 -08:00
40c3a6c1e1 cuda : ignore peer access already enabled errors (#5597)
* cuda : ignore peer access already enabled errors

* fix hip
b2205
2024-02-19 23:40:26 +01:00
f24ed14ee0 make : pass CPPFLAGS directly to nvcc, not via -Xcompiler (#5598) b2204 2024-02-19 15:54:12 -05:00
9d679f0fcc examples : support minItems/maxItems in JSON grammar converter (#5039)
* support minLength and maxLength in JSON schema grammar converter

* Update examples/json-schema-to-grammar.py

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2203
2024-02-19 16:14:07 +02:00
1387cf60f7 llava : remove extra cont (#5587) b2202 2024-02-19 15:23:17 +02:00
6fd413791a llava : replace ggml_cpy with ggml_cont b2201 2024-02-19 15:09:43 +02:00
337c9cbd52 sync : ggml
ggml-ci
2024-02-19 15:09:43 +02:00
a3145bdc30 ggml-alloc : apply ggml/731 2024-02-19 15:09:43 +02:00
890559ab28 metal : option to embed MSL source into compiled binary (whisper/1842)
* ggml : embed Metal library source (ggml-metal.metal) into binary

enable by setting WHISPER_EMBED_METAL_LIBRARY

* rename the build option

* rename the preprocessor directive

* generate Metal library embedding assembly on-fly during build process
2024-02-19 15:09:43 +02:00
d0e3ce51f4 ci : enable -Werror for CUDA builds (#5579)
* cmake : pass -Werror through -Xcompiler

ggml-ci

* make, cmake : enable CUDA errors on warnings

ggml-ci
b2197
2024-02-19 14:45:41 +02:00
68a6b98b3c make : fix CUDA build (#5580) b2196 2024-02-19 13:41:51 +02:00
70d45af0ef readme : fix typo in README-sycl.md (#5353) 2024-02-19 12:37:10 +02:00
13e2c771aa cmake : remove obsolete sycl compile flags (#5581)
* rm unwanted sycl compile options

* fix bug

* fix bug

* format fix
b2194
2024-02-19 11:15:18 +02:00
f53119cec4 minor : fix trailing whitespace (#5538) b2193 2024-02-19 10:34:10 +02:00
7084755396 llava : avoid changing the original BakLLaVA model (#5577)
This is a follup of Commit fc0c8d286a
("llava : update surgery script to not remove tensors") but this time
the change is to the BakLLaVA specific part of the surgery script.

I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
as expected using the instructions in README.md.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-02-19 10:31:59 +02:00
4480542b22 baby-llama : allocate graphs in ggml_context (#5573)
* Fixed the baby-llama issue (see issue #4830)

* minor : fix whitespaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2191
2024-02-19 10:25:38 +02:00
11b12de39b llama : add llama_chat_apply_template() (#5538)
* llama: add llama_chat_apply_template

* test-chat-template: remove dedundant vector

* chat_template: do not use std::string for buffer

* add clarification for llama_chat_apply_template

* llama_chat_apply_template: add zephyr template

* llama_chat_apply_template: correct docs

* llama_chat_apply_template: use term "chat" everywhere

* llama_chat_apply_template: change variable name to "tmpl"
b2190
2024-02-19 10:23:37 +02:00
3a9cb4ca64 cuda, metal : fix nans in soft_max (#5574)
* cuda : fix nans in soft_max

* metal : fix nans in soft_max

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b2189
2024-02-19 10:04:45 +02:00
769a716e30 readme : update (#5572)
Added 1.5-bit on README.md
2024-02-19 09:39:31 +02:00
f0d1fafc02 ggml : android and old glibc NUMA incompatibility bugfixes (#5557)
* #ifdef out some code NUMA blocks for Android due to lack of support

* added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper

* Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc

* harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways

---------

Co-authored-by: root <root@nenya.lothlorien.ca>
b2187
2024-02-19 09:38:32 +02:00
a0c2dad9d4 build : pass all warning flags to nvcc via -Xcompiler (#5570)
* build : pass all warning flags to nvcc via -Xcompiler
* make : fix apparent mis-merge from #3952
* make : fix incorrect GF_CC_VER for CUDA host compiler
b2186
2024-02-18 16:21:52 -05:00
14278f55d2 ggml : restore vec dot stride arg names (#5453) b2185 2024-02-18 22:58:57 +02:00
b1de96824b ci : fix wikitext url + compile warnings (#5569)
ggml-ci
b2184
2024-02-18 22:39:30 +02:00
7ad554f90e metal : fix unused warnings (#0) 2024-02-18 21:39:58 +02:00
5ee99c32f5 common, server : surface min_keep as its own parameter (#5567)
* Feature - surface min_keep as its own parameter

* Updated README with min_keep param
b2182
2024-02-18 21:11:16 +02:00
c145f8a132 server : slots monitoring endpoint (#5550) b2181 2024-02-18 19:39:57 +02:00
689a091bbe sampling : do not set min_keep to n_probs (#5564) b2180 2024-02-18 19:38:06 +02:00
f3f28c5395 cmake : fix GGML_USE_SYCL typo (#5555) b2179 2024-02-18 19:17:00 +02:00
e75c6279d1 server : enhanced health endpoint (#5548)
* server: enrich health endpoint with available slots, return 503 if not slots are available

* server: document new status no slot available in the README.md
b2178
2024-02-18 18:31:28 +02:00
36376abe05 server : --n-predict option document and cap to max value (#5549)
* server: document --n-predict

* server: ensure client request cannot override n_predict if set

* server: fix print usage LF in new --n-predict option
b2177
2024-02-18 18:30:09 +02:00
66c1968f7a server : graceful server shutdown (#5244)
This updates the server queue to support graceful shutdown of the server on signals.
b2176
2024-02-18 18:23:16 +02:00
1dcc3fde00 common : fix ub (#5530) b2175 2024-02-18 18:21:52 +02:00
5d3de51f97 ggml, common, examples, tests : fixed type arguments in printf (#5528) b2174 2024-02-18 18:20:12 +02:00
fc0c8d286a llava : update surgery script to not remove tensors (#5536)
This commit updates the surgery script to not remove the tensors from the
model file. For this to work the `--skip-unknown` flag is added as an
argument to the convert.py script in README.md.

The motivation for this change is that the surgery script currently
removes the projector tensors from the model file. If the model was
checked out from a repository, the model file will have been updated
and have to be checked out again to reset this effect. If this can be
avoided I think it would be preferable.

I did not perform this change for BakLLaVA models as I am not sure
how that part works.
2024-02-18 18:19:23 +02:00
bd2d4e393b 1.5 bit quantization (#5453)
* iq1_s: WIP basics

* iq1_s: CUDA is working

* iq1_s: scalar CPU dot product

* iq1_s: WIP AVX2 dot product - something is not right

* Fix tests

* Fix shadow warnings

* Fix after merge with latest master

* iq1_s: AVX2 finally works

* iq1_s: ARM_NEON dot product. Works, but not very fast

* iq1_s: better grid

* iq1_s: use IQ2_XXS for attn_output

At a cost of 0.04 extra bpw this gives a big improvement in PPL.

* iq1_s: Metal basics

Dequantize works, but not dot product

* iq1_s: Metal works, but quite slow

As usual, Apple Silicon does not like the code I write.

* iq1_s: Tests

* iq1_s: slightly faster dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
b2172
2024-02-18 18:16:55 +02:00
c8e0d7efeb flake.lock: Update
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
  → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
2024-02-18 06:39:58 -08:00
8f1be0d42f ggml : add ALiBi support for ggml_soft_max_ext (#5488)
* ggml : avoid recomputing alibi slopes (CPU)

* llama : reuse hparams.f_max_alibi_bias in all cases

ggml-ci

* ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)

ggml-ci

* ggml : handle all SRCs (do not break on first null)

ggml-ci

* tests : do not use slope for large soft_max

accumulates too much error

ggml-ci

* ggml : alternative ALiBi without extra tensor

We compute the slopes in the kernel

ggml-ci

* cuda : add ALiBi support in ggml_soft_max_ext

ggml-ci

* ggml : deprecate ggml_alibi

* ggml : support multi-sequence ALiBi (Metal)

ggml-ci

* cuda : add multi-seq ALiBi + remote F16 soft_max

ggml-ci

* ggml : update deprecation message

* ggml : fix pos ptr when no ALiBi

ggml-ci

* cuda : fix performance (pow -> powf)

* cuda : precompute ALiBi constants

* metal : pre-compute ALiBi slopes

ggml-ci

* llama : init kq_pos only if needed

ggml-ci

* test-backend-ops : add null pos test to soft_max

test-backend-ops : replace soft_max tests

ggml-ci

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-17 23:04:16 +02:00
6e4e973b26 ci : add an option to fail on compile warning (#3952)
* feat(ci): add an option to fail on compile warning

* Update CMakeLists.txt

* minor : fix compile warnings

ggml-ci

* ggml : fix unreachable code warnings

ggml-ci

* ci : disable fatal warnings for windows, ios and tvos

* ggml : fix strncpy warning

* ci : disable fatal warnings for MPI build

* ci : add fatal warnings to ggml-ci

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-17 23:03:14 +02:00
d250c9d61d gitignore : update for CLion IDE (#5544) b2168 2024-02-17 18:28:37 +02:00
5bf2b94dd4 cmake : fix VULKAN and ROCm builds (#5525)
* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci
b2167
2024-02-16 19:05:56 +02:00
d2819d5577 scripts : add helpers script for bench comparing commits (#5521)
* scripts : add helpers script for bench comparing commits

* scripts : detect CUDA

* set flags after checking the command line

* fix make flags

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-02-16 15:14:40 +02:00
4cb0727698 llava : removed excess free(NULL) operation (#5531) 2024-02-16 14:43:23 +02:00