Commit Graph

5618 Commits

Author SHA1 Message Date
60aea028b5 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
9c55e5c5c2 fix: check model pointer validity before use (#13631) b5417 2025-05-19 13:25:41 +03:00
33d7aed4a8 CANN: Support MOE Model MUL_MAT_ID (#13042)
Signed-off-by: noemotiovon <757486878@qq.com>
b5416
2025-05-19 14:21:17 +08:00
6a2bc8bfb7 server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag

* reworded documentation comment

* updated server README.md
b5415
2025-05-17 23:59:48 +02:00
e3a7cf6c5b cmake: use the current build config for vulkan-shaders-gen (#13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
b5414
2025-05-17 15:26:43 -03:00
518329b2d4 parallel : add option for non-shared and larger prompts (#13598)
* parallel : add option for non-shared and larger prompts

* parallel : update readme [no ci]

* cont : add note about base models [no ci]

* parallel : better var name

ggml-ci
2025-05-17 12:58:55 +03:00
2f5a4e1e09 vulkan: move common FA code to flash_attn_base.comp (#13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
b5412
2025-05-17 09:14:55 +02:00
4f41ee11d6 vulkan: use scalar FA rather than coopmat2 when N==1 (#13554) b5411 2025-05-17 08:35:47 +02:00
Z
3e0be1cace llguidance : official v0.7.20 release (no actual changes) [noci] (#13594) b5410 2025-05-16 22:56:28 +02:00
6aa892ec2a server : do not return error out of context (with ctx shift disabled) (#13577) b5409 2025-05-16 21:50:00 +02:00
aea9f8b4e7 webui : improve accessibility for visually impaired people (#13551)
* webui : improve accessibility for visually impaired people

* add a11y for extra contents

* fix some labels being read twice

* add skip to main content
2025-05-16 21:49:01 +02:00
06c1e4abc1 readme : add list of dependencies and their license (#13591) 2025-05-16 20:04:18 +02:00
415e40a357 releases : use arm version of curl for arm releases (#13592) b5406 2025-05-16 19:36:51 +02:00
654a67794f metal : add FA-vec kernel for head size 64 (#13583)
ggml-ci
b5405
2025-05-16 20:32:58 +03:00
5364ae4ba5 llama : print hint when loading a model when no backends are loaded (#13589) b5404 2025-05-16 16:38:07 +02:00
7c07ac244d ci : add ppc64el to build-linux-cross (#13575) 2025-05-16 14:54:23 +02:00
0a338ed013 sycl : fixed compilation warnings (#13582) b5402 2025-05-16 18:15:29 +08:00
bc098c3cf0 minja: sync (qwen3) (#13573)
* minja: sync f06140fa52

- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58

---------

Co-authored-by: ochafik <ochafik@google.com>
b5401
2025-05-15 23:29:10 +01:00
c6a2c9e741 gguf : use ggml log system (#13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
b5400
2025-05-15 19:13:11 +02:00
07ad2b6db3 gguf-py : fix disconnect-before-connect in editor-gui (#13569)
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
c531edfa34 convert : fix conversion for llama 4 (#13567) 2025-05-15 17:40:07 +02:00
02cdd2d8b0 sycl: simplify bin_bcast_kernel (#13383) 2025-05-15 17:39:52 +02:00
64bb51cf90 sycl: reordered Q4_K MMVQ (#13109) 2025-05-15 17:35:44 +02:00
9c404ed54c sycl: use oneDNN for matrices multiplication (#12972) b5395 2025-05-15 16:53:41 +02:00
6c8b91500e llama-bench : fix -ot with dl backends (#13563) b5394 2025-05-15 15:46:55 +02:00
3cc1f1f1d2 webui : handle PDF input (as text or image) + convert pasted long content to file (#13562)
* webui : handle PDF input (as text or image)

* handle the case where pdf image + server without mtmd

* fix bug missing pages
2025-05-15 14:24:50 +02:00
c753d7bed0 server : proper error handling for missing elements in messages array (OpenAI compatible backend) (#13540) b5392 2025-05-15 08:40:58 +02:00
b2838049cc bench : handle decode errors (#13548)
ggml-ci
b5391
2025-05-15 05:57:02 +03:00
aa48e373f2 server: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b5390
2025-05-15 02:39:51 +01:00
e3a9421b78 kv-cache : fix out-of-bounds view during reserve graph (#13547)
* kv-cache : fix reserve graph out-of-bounds access

ggml-ci

* cont : add comment

* cont : fix comments [no ci]

* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
5ab5d5fb25 arm64: optimize q6_k_q8_k kernel with i8mm (#13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
b5388
2025-05-14 21:53:52 +02:00
3198405e98 common: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5387
2025-05-14 19:50:57 +01:00
f5170c1d7a editorconfig : fix trailing whitespace from #13542 (#13546) 2025-05-14 21:22:49 +03:00
017f10b5fa fix: crash when calling llama_state_get_size on a context without a KV cache (#13542) b5385 2025-05-14 19:18:18 +03:00
4696d56749 CUDA: fix crash on large batch size for quant. MoE (#13537) b5384 2025-05-14 16:41:02 +02:00
b7d2672082 llama : fix quantize with dl backends (#13539) 2025-05-14 16:12:36 +02:00
6da34fa276 CUDA: faster Deepseek FA, add Turing support (#13435) b5382 2025-05-14 16:08:20 +02:00
5e7d95e22e fix: Move build_inp_pos to the top of the graph section for build_granite (#13538)
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b5381
2025-05-14 15:53:59 +03:00
053174436f server : passthrough the /models endpoint during loading (#13535)
* server : passthrough the /models endpoint during loading

* server : update readme + return json for "meta" field
b5380
2025-05-14 15:42:10 +03:00
360a9c98e1 server : fix cache_tokens bug with no cache_prompt (#13533) b5379 2025-05-14 13:35:07 +02:00
09d13d94fb cmake: simplify vulkan shader test logic (#13263) b5378 2025-05-14 07:53:57 -03:00
24e86cae72 vulkan: KHR_coopmat flash attention (#13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
b5377
2025-05-14 11:55:26 +02:00
bb1681fbd5 webui : use fflate for more deterministic gzip compress (#13525)
* webui : use pako for more deterministic gzip compress

* simpler code

* use fflate instead of pako
2025-05-14 10:26:12 +02:00
d486dd3e8e webui: Allow pasting file from clipboard (#13526)
* server: Allow pasting file from clipboard

* server: Prevent default action on file paste

* update build

* format then build combined

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-14 10:07:31 +02:00
21ca987fba docs: Update link to ggml-org in multimodal.md (#13513)
* Update multimodal.md

Minor change to include the huggingface link

* Update docs/multimodal.md

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-14 09:59:12 +02:00
be1d4a13db scripts : fix compare-llama-bench.py show parameter (#13514) 2025-05-14 08:41:01 +02:00
ab3971f2a0 vulkan: workaround FA compile failures on macos (#13517) b5372 2025-05-14 06:15:50 +02:00
e5c834f718 quantize : improve tensor-type pattern matching (#13033) b5371 2025-05-13 19:12:31 +02:00
71bdbdb587 clip : clip.h become private API (⚠️ breaking change) (#13510) b5370 2025-05-13 17:07:21 +02:00
f0995d28ce metal : use FA-vec kernel up to batch size 20 (#13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
b5369
2025-05-13 18:04:39 +03:00