Commit Graph

3459 Commits

Author SHA1 Message Date
95f57bb5d5 ggml : remove ggml_task_type and GGML_PERF (#8017)
* ggml : remove ggml_task_type and GGML_PERF

* check abort_callback on main thread only

* vulkan : remove usage of ggml_compute_params

* remove LLAMA_PERF
b3209
2024-06-24 03:07:59 +02:00
e112b610a1 llama : add support for BitnetForCausalLM (#7931)
* hf bitnet v1

* hf bitnet e2e v2

* finish bitnet e2e

* finish f16 hf bitnet e2e

* remove unsed

* finish bitnet i2 e2e

* move i2s to quantize v1

* move i2 to quantize

* clean code

* clean code 2

* fix codestyle

* fix code

* fix

* fix code

* fix merge

* remove unused

* change table name

* fix whitespace

* delete redundant

* i2_s to absmax

* finish i2_s/i8_s vec_dot x86 simd

* i2s->q22

* fix code

* remove block scale

* add dequantize

* fix seq

* update avx2

* remove q2_2

* remove q22_grid

* fix whitespace

* reuse llm_build_kv

* fix bo

---------

Co-authored-by: root <root@wangjinheng>
b3208
2024-06-23 21:27:57 +03:00
6a2f298bd7 server : fix JSON-Scheme typo (#7975) 2024-06-23 11:03:08 -04:00
11318d9aa1 Fix typo in llama_set_embeddings comment (#8077) b3206 2024-06-23 15:39:45 +02:00
b6b9a8e606 fix CI failures (#8066)
* test-backend-ops : increase cpy max nmse

* server ci : disable thread sanitizer
b3205
2024-06-23 13:14:45 +02:00
45c0e2e4c1 Refactor Vulkan backend to allow multiple contexts (#7961)
* Refactor Vulkan backend to allow multiple contexts

* Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs

* Fix Vulkan debug build error
b3204
2024-06-23 10:21:25 +02:00
b5a5f34efa Removing extra blank lines that were breaking Lint. (#8067) b3203 2024-06-22 14:28:18 -04:00
3e58b0ee35 cvector: fix CI + correct help message (#8064)
* cvector: fix CI + correct help message

* also correct --pca-iter
b3202
2024-06-22 18:11:30 +02:00
adf480c3ab cvector-generator: Moe Moe Fixie-Fixie for Lots of Formats~! ♡(ᐢ ᴥ ᐢ)♡ (#8052)
* Update negative.txt

* Update positive.txt

* Update cvector-generator.cpp

* Update cvector-generator.cpp
b3201
2024-06-22 17:19:37 +02:00
3aa184a8c7 convert-hf : change assert to exception (#8015) 2024-06-22 15:37:41 +02:00
5b48cd53a8 Update llama-quantize ppl/file size output from LLaMA-v1 to Llama-3 values (#8058)
Uses the values computed by @JohannesGaessler in PR #7413
b3199
2024-06-22 15:16:10 +02:00
c5a8d4b749 JSON Schema to GBNF integration tests (#7790)
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
2024-06-21 23:18:36 -04:00
557b653dc9 vulkan: detect multiple devices by deviceUUID instead of deviceID (#8022)
* vulkan: detect multiple devices by deviceUUID instead of deviceID

* vulkan: remove unneeded variables

* vulkan: fix id query
b3197
2024-06-21 10:28:20 +02:00
Eve
7d5e8777ae ggml : AVX IQ quants (#7845)
* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply
b3196
2024-06-21 08:57:36 +03:00
a927b0f3dd llama : optimize long word tokenization with WPM (#8034)
ggml-ci
b3195
2024-06-21 08:51:28 +03:00
80ea089d77 llama : allow pooled embeddings on any model (#7477)
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples

* find result_norm/result_embd tensors properly; update output allocation logic

* only use embd output for pooling_type NONE

* get rid of old causal_attn accessor

* take out attention_type; add in llama_set_embeddings

* bypass logits when doing non-NONE pooling
b3194
2024-06-21 08:38:22 +03:00
0e64591e82 swiftui : enable stream updating (#7754) b3193 2024-06-21 08:30:58 +03:00
b1ef562bc1 requirements : Bump torch and numpy for python3.12 (#8041) 2024-06-20 22:01:15 +02:00
17b291a6a5 convert-hf : Fix the encoding in the convert-hf-to-gguf-update.py (#8040) 2024-06-20 21:59:59 +02:00
abd894ad96 common: fix warning (#8036)
* common: fix warning

* Update common/common.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
b3190
2024-06-20 16:40:13 +02:00
de391e4c80 [SYCL] Fix windows build and inference (#8003)
* add sycl preset

* fix debug link error. fix windows crash

* update README
b3189
2024-06-20 21:19:05 +08:00
d50f8897a7 CUDA: stream-k decomposition for MMQ (#8018)
* CUDA: stream-k decomposition for MMQ

* fix undefined memory reads for small matrices
b3188
2024-06-20 14:39:21 +02:00
2075a66a96 metal : fix ggml_metal_supports_op for BF16 (#8021)
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
b3187
2024-06-20 08:32:01 +03:00
ba58993152 server : fix smart slot selection (#8020) b3186 2024-06-20 09:57:10 +10:00
a7854743c5 un-ignore build-info.cmake and build-info.sh (#7996)
* un-ignore `build-info.cmake` and `build-info.sh`

I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.

* un-ignore `build-info.cpp.in`

For the same reason as the previous two files.

* Reorganize `.gitignore`

* Add exceptions for files mentioned by @slaren

I did leave .clang-tidy since it was explicitly ignored before.

* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted

* Remove `.clang-tidy` from `.gitignore`

Per comment by @ggerganov

* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
2024-06-19 22:10:42 +02:00
9c77ec1d74 ggml : synchronize threads using barriers (#7993) b3184 2024-06-19 15:04:15 +02:00
a04a953cab codecov : remove (#8004) b3183 2024-06-19 13:04:36 +03:00
623494a478 [SYCL] refactor (#6408)
* seperate lower precision GEMM from the main files

* fix workgroup size hardcode
b3182
2024-06-19 09:11:51 +08:00
37bef89433 tokenizer : BPE fixes (#7530)
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
b3181
2024-06-18 18:40:52 +02:00
91c188d6c2 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
b3180
2024-06-18 22:19:45 +10:00
84f6de17f6 Fix no gcc pragma on Windows (#7751) b3179 2024-06-18 22:18:32 +10:00
61665277af Allow compiling with CUDA without CUDA runtime installed (#7989)
On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages.  Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.

The development environment is prepared for such situations.  There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory.  Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.
b3178
2024-06-18 14:00:14 +02:00
b96f9afb0d chore: clean useless beam search param (#7985)
Signed-off-by: thxCode <thxcode0824@gmail.com>
b3177
2024-06-18 10:11:40 +03:00
1193778105 readme : update UI list (#7943) 2024-06-18 09:57:41 +03:00
5326bcceeb ggml : sync b3175 2024-06-18 09:50:45 +03:00
e6ecc2be47 whisper : use ggml_backend_sched (whisper/2239)
* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:50:40 +03:00
a94e6ff877 update: support Qwen2-57B-A14B (#7835)
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shared_exp are now properly calculated

* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM

* fix: QWEN2MOE support for expert_feed_forward_length

previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH

n_ff_exp and n_ff_shexp are now properly calculated
b3173
2024-06-17 21:08:46 +02:00
5b6da18750 Make updates to type cast based on compiler instead of OS (#7851) b3172 2024-06-17 20:23:17 +02:00
7c26775adb llama : disable FA if KV head size do not match (#7982) b3171 2024-06-17 19:40:01 +03:00
b473e95084 Add Nix and Flox install instructions (#7899) 2024-06-17 09:37:55 -06:00
99052cd227 sched : offload_op also requires supports_op (#7977) b3169 2024-06-17 16:51:42 +02:00
c637fcd34d fix: divide 0 exception in mamba (#7932)
Signed-off-by: thxCode <thxcode0824@gmail.com>
b3168
2024-06-17 16:11:08 +02:00
6a2f0b3474 Implement non-mapped async IO for CUDA on Windows. (#7896)
* Implement non-mapped async IO for CUDA on Windows. On a fast Gen5 NVMe drive this change improves model load time by >3x while it should be the same (or slightly faster) on any other drive.

* Free resources except for backend.

* Change assertions to exceptions in llama_file, find correct cuda backend to create CUDA resources and respect the use_mmap flag again for CUDA.

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* Fix editorconfig and unused variable

* Fix issues with Windows build

---------

Co-authored-by: slaren <slarengh@gmail.com>
b3167
2024-06-17 16:10:15 +02:00
21be9cab94 rpc : fix load/store misaligned addresses (#7948) b3166 2024-06-17 11:09:20 +03:00
006167aaf6 gguf-dump.py: add --markdown dump output (#7853)
* gguf-dump.py: add --markdown dump output

* gguf-dump.py: Add toc

* gguf-dump.py: use standard tensor name lookup. Also add tensor ID field

* gguf-dump.py: Add tensor overview count

* gguf-dump.py: fix array preview

* gguf-dump.py: markdownTableWithAlignmentSupport() added

* Add type hints and spacing

Co-authored-by: compilade <git@compilade.net>

* gguf-dump.py: prettyfy dimention

* gguf-dump: right align element count

* gguf-dump.py: element count autosizing

* Apply suggestions from code review

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
2024-06-17 15:25:20 +10:00
df68d4fa5d [SYCL] Update README-sycl.md for Chapter "Recommended release" and "News" (#7946)
* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md

* Update README-sycl.md
b3164
2024-06-17 11:17:07 +08:00
43b35e38ba Add support for sqrt on CUDA (#7953)
* cuda sqrt support

* enable cuda in pca

* fix comments in pca

* add test

* add sqrt to ggml_backend_cuda_supports_op

* fix test

* new line

* Use F32 sqrtf instead of F64 sqrt

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b3163
2024-06-17 00:23:04 +02:00
19b7a836f6 cuda : fix bounds check for src0 rows in MMVQ kernel (whisper/2231)
* cuda : fix bounds check for src0 rows in MMVQ kernel

* Update ggml-cuda/mmvq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b3162
2024-06-16 20:32:49 +03:00
b5fcf8ef5c ggml : fix and optimize ppc64le (ggml/849)
* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 20:32:49 +03:00
398105ff43 ggml : remove duplicate include of ggml-common.h (ggml/853)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 20:32:49 +03:00