Commit Graph

5433 Commits

Author SHA1 Message Date
759e37b0d8 tests : avoid github urls due to throttling (#13654) 2025-05-20 12:03:17 +02:00
4245e622e0 sycl: disable reorder for sycl mulmat (#13536) b5432 2025-05-20 11:34:15 +02:00
c9c64dee57 Set GLM4 blk.*.attn_output.weight, kqv_out-* matmul to GGML_PREC_F32 to fix infinity values in output (#13639) b5431 2025-05-20 10:11:56 +02:00
c00a2634be metal : fix typo in FA kernel comments (#13651) b5430 2025-05-20 10:41:40 +03:00
e298d2fbd0 kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA

ggml-ci

* kv-cache : initial iSWA implementation

ggml-ci

* kv-cache : rework error recovery logic

ggml-ci

* models : fix Phi-3 SWA parameters

ggml-ci

* model : adjust Granite to rope factor changes

ggml-ci

* server : check if context can do shifts

ggml-ci

* iswa : for now, always enable shifts (experiment)

ggml-ci

* kv-cache : simplify SWA logic

ggml-ci

* kv-cache : apply defrag when we fail to find slots for the batch

ggml-ci

* llama : update docs about llama_decode

ggml-ci

* kv-cache : update warning logs when no space for the batch is available

ggml-ci

* llama : add llama_kv_self_seq_pos_min()

* kv-cache : keep track of partial SWA computes and print warnings

* server : disallow use cases involving partial SWA context

ggml-ci

* llama : add param to control SWA cache size

ggml-ci

* minor : clean-up

ggml-ci
b5429
2025-05-20 08:05:46 +03:00
f0adb80bf7 CANN: Update CANN model support (#13162)
* Update CANN model support status

* Update of model support

* update

* update

* update

* fix format of CANN.md

* fix format of CANN.md

* fix format of CANN.md
2025-05-20 11:43:43 +08:00
f7c9429c85 sycl : Overcoming workaround for mmap() allocation on Windows (#13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
b5427
2025-05-20 08:54:43 +08:00
1dfbf2cf3a common : add load_progress_callback (#13617) b5426 2025-05-19 21:17:36 +02:00
8960efd0a6 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607) b5425 2025-05-19 17:54:08 +02:00
725f23f1f3 sycl : backend documentation review (#13544)
* sycl: reviewing and updating docs

* Updates Runtime error codes

* Improves OOM troubleshooting entry

* Added a llama 3 sample

* Updated supported models

* Updated releases table
2025-05-19 14:38:20 +01:00
92ecdcc06a mtmd : add vision support for llama 4 (#13282)
* wip llama 4 conversion

* rm redundant __init__

* fix conversion

* fix conversion

* test impl

* try this

* reshape patch_embeddings_0

* fix view

* rm ffn_post_norm

* cgraph ok

* f32 for pos embd

* add image marker tokens

* Llama4UnfoldConvolution

* correct pixel shuffle

* fix merge conflicts

* correct

* add debug_graph

* logits matched, but it still preceives the image incorrectly

* fix style

* add image_grid_pinpoints

* handle llama 4 preprocessing

* rm load_image_size

* rm unused line

* fix

* small fix 2

* add test & docs

* fix llava-1.6 test

* test: add notion of huge models

* add comment

* add warn about degraded quality
b5423
2025-05-19 13:04:14 +02:00
f71f40a284 ci : upgraded oneAPI version in SYCL workflows and dockerfile (#13532) b5422 2025-05-19 11:46:09 +01:00
d30cb5a7fa sync : ggml
ggml-ci
b5421
2025-05-19 13:29:56 +03:00
6c35981a64 mnist: fix segmentation fault (ggml/1227) 2025-05-19 13:29:56 +03:00
8b5e19aea6 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) 2025-05-19 13:29:56 +03:00
60aea028b5 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
9c55e5c5c2 fix: check model pointer validity before use (#13631) b5417 2025-05-19 13:25:41 +03:00
33d7aed4a8 CANN: Support MOE Model MUL_MAT_ID (#13042)
Signed-off-by: noemotiovon <757486878@qq.com>
b5416
2025-05-19 14:21:17 +08:00
6a2bc8bfb7 server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
b5415
2025-05-17 23:59:48 +02:00
e3a7cf6c5b cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
b5414
2025-05-17 15:26:43 -03:00
518329b2d4 parallel : add option for non-shared and larger prompts (#13598)
* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci
2025-05-17 12:58:55 +03:00
2f5a4e1e09 vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
b5412
2025-05-17 09:14:55 +02:00
4f41ee11d6 vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) b5411 2025-05-17 08:35:47 +02:00
Z
3e0be1cace llguidance : official v0.7.20 release (no actual changes) [noci] (#13594) b5410 2025-05-16 22:56:28 +02:00
6aa892ec2a server : do not return error out of context (with ctx shift disabled) (#13577) b5409 2025-05-16 21:50:00 +02:00
aea9f8b4e7 webui : improve accessibility for visually impaired people (#13551)
* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content
2025-05-16 21:49:01 +02:00
06c1e4abc1 readme : add list of dependencies and their license (#13591) 2025-05-16 20:04:18 +02:00
415e40a357 releases : use arm version of curl for arm releases (#13592) b5406 2025-05-16 19:36:51 +02:00
654a67794f metal : add FA-vec kernel for head size 64 (#13583)
ggml-ci
b5405
2025-05-16 20:32:58 +03:00
5364ae4ba5 llama : print hint when loading a model when no backends are loaded (#13589) b5404 2025-05-16 16:38:07 +02:00
7c07ac244d ci : add ppc64el to build-linux-cross (#13575) 2025-05-16 14:54:23 +02:00
0a338ed013 sycl : fixed compilation warnings (#13582) b5402 2025-05-16 18:15:29 +08:00
bc098c3cf0 minja: sync (qwen3) (#13573)
* minja: sync f06140fa52

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <ochafik@google.com>
b5401
2025-05-15 23:29:10 +01:00
c6a2c9e741 gguf : use ggml log system (#13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
b5400
2025-05-15 19:13:11 +02:00
07ad2b6db3 gguf-py : fix disconnect-before-connect in editor-gui (#13569)
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
c531edfa34 convert : fix conversion for llama 4 (#13567) 2025-05-15 17:40:07 +02:00
02cdd2d8b0 sycl: simplify bin_bcast_kernel (#13383) 2025-05-15 17:39:52 +02:00
64bb51cf90 sycl: reordered Q4_K MMVQ (#13109) 2025-05-15 17:35:44 +02:00
9c404ed54c sycl: use oneDNN for matrices multiplication (#12972) b5395 2025-05-15 16:53:41 +02:00
6c8b91500e llama-bench : fix -ot with dl backends (#13563) b5394 2025-05-15 15:46:55 +02:00
3cc1f1f1d2 webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)
* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages
2025-05-15 14:24:50 +02:00
c753d7bed0 server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540) b5392 2025-05-15 08:40:58 +02:00
b2838049cc bench : handle decode errors (#13548)
ggml-ci
b5391
2025-05-15 05:57:02 +03:00
aa48e373f2 server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5390
2025-05-15 02:39:51 +01:00
e3a9421b78 kv-cache : fix out-of-bounds view during reserve graph (#13547)
* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
5ab5d5fb25 arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
b5388
2025-05-14 21:53:52 +02:00
3198405e98 common: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5387
2025-05-14 19:50:57 +01:00
f5170c1d7a editorconfig : fix trailing whitespace from #13542 (#13546) 2025-05-14 21:22:49 +03:00
017f10b5fa fix: crash when calling llama_state_get_size on a context without a KV cache (#13542) b5385 2025-05-14 19:18:18 +03:00
4696d56749 CUDA: fix crash on large batch size for quant. MoE (#13537) b5384 2025-05-14 16:41:02 +02:00