60aea028b5
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 13:29:56 +03:00
9c55e5c5c2
fix: check model pointer validity before use ( #13631 )
b5417
2025-05-19 13:25:41 +03:00
33d7aed4a8
CANN: Support MOE Model MUL_MAT_ID ( #13042 )
...
Signed-off-by: noemotiovon <757486878@qq.com >
b5416
2025-05-19 14:21:17 +08:00
6a2bc8bfb7
server : added --no-prefill-assistant flag ( #13608 )
...
* added no-prefill-assistant flag
* reworded documentation comment
* updated server README.md
b5415
2025-05-17 23:59:48 +02:00
e3a7cf6c5b
cmake: use the current build config for vulkan-shaders-gen ( #13595 )
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
b5414
2025-05-17 15:26:43 -03:00
518329b2d4
parallel : add option for non-shared and larger prompts ( #13598 )
...
* parallel : add option for non-shared and larger prompts
* parallel : update readme [no ci]
* cont : add note about base models [no ci]
* parallel : better var name
ggml-ci
2025-05-17 12:58:55 +03:00
2f5a4e1e09
vulkan: move common FA code to flash_attn_base.comp ( #13556 )
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
b5412
2025-05-17 09:14:55 +02:00
4f41ee11d6
vulkan: use scalar FA rather than coopmat2 when N==1 ( #13554 )
b5411
2025-05-17 08:35:47 +02:00
3e0be1cace
llguidance : official v0.7.20 release (no actual changes) [noci] ( #13594 )
b5410
2025-05-16 22:56:28 +02:00
6aa892ec2a
server : do not return error out of context (with ctx shift disabled) ( #13577 )
b5409
2025-05-16 21:50:00 +02:00
aea9f8b4e7
webui : improve accessibility for visually impaired people ( #13551 )
...
* webui : improve accessibility for visually impaired people
* add a11y for extra contents
* fix some labels being read twice
* add skip to main content
2025-05-16 21:49:01 +02:00
06c1e4abc1
readme : add list of dependencies and their license ( #13591 )
2025-05-16 20:04:18 +02:00
415e40a357
releases : use arm version of curl for arm releases ( #13592 )
b5406
2025-05-16 19:36:51 +02:00
654a67794f
metal : add FA-vec kernel for head size 64 ( #13583 )
...
ggml-ci
b5405
2025-05-16 20:32:58 +03:00
5364ae4ba5
llama : print hint when loading a model when no backends are loaded ( #13589 )
b5404
2025-05-16 16:38:07 +02:00
7c07ac244d
ci : add ppc64el to build-linux-cross ( #13575 )
2025-05-16 14:54:23 +02:00
0a338ed013
sycl : fixed compilation warnings ( #13582 )
b5402
2025-05-16 18:15:29 +08:00
bc098c3cf0
minja: sync (qwen3) ( #13573 )
...
* minja: sync f06140fa52
- https://github.com/google/minja/pull/67 (@grf53)
- https://github.com/google/minja/pull/66 (@taha-yassine)
- https://github.com/google/minja/pull/63 (@grf53)
- https://github.com/google/minja/pull/58
---------
Co-authored-by: ochafik <ochafik@google.com >
b5401
2025-05-15 23:29:10 +01:00
c6a2c9e741
gguf : use ggml log system ( #13571 )
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
b5400
2025-05-15 19:13:11 +02:00
07ad2b6db3
gguf-py : fix disconnect-before-connect in editor-gui ( #13569 )
...
The bug caused a crash upon load with venvs created with
--system-site-packages to use
python3-pyside6.qtwidgets=python3-pyside6.qtwidgets=6.6.2-4
from Kubuntu 24.10.
2025-05-15 18:47:10 +02:00
c531edfa34
convert : fix conversion for llama 4 ( #13567 )
2025-05-15 17:40:07 +02:00
02cdd2d8b0
sycl: simplify bin_bcast_kernel ( #13383 )
2025-05-15 17:39:52 +02:00
64bb51cf90
sycl: reordered Q4_K MMVQ ( #13109 )
2025-05-15 17:35:44 +02:00
9c404ed54c
sycl: use oneDNN for matrices multiplication ( #12972 )
b5395
2025-05-15 16:53:41 +02:00
6c8b91500e
llama-bench : fix -ot with dl backends ( #13563 )
b5394
2025-05-15 15:46:55 +02:00
3cc1f1f1d2
webui : handle PDF input (as text or image) + convert pasted long content to file ( #13562 )
...
* webui : handle PDF input (as text or image)
* handle the case where pdf image + server without mtmd
* fix bug missing pages
2025-05-15 14:24:50 +02:00
c753d7bed0
server : proper error handling for missing elements in messages array (OpenAI compatible backend) ( #13540 )
b5392
2025-05-15 08:40:58 +02:00
b2838049cc
bench : handle decode errors ( #13548 )
...
ggml-ci
b5391
2025-05-15 05:57:02 +03:00
aa48e373f2
server
: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b5390
2025-05-15 02:39:51 +01:00
e3a9421b78
kv-cache : fix out-of-bounds view during reserve graph ( #13547 )
...
* kv-cache : fix reserve graph out-of-bounds access
ggml-ci
* cont : add comment
* cont : fix comments [no ci]
* cont : more correct comment [no ci]
2025-05-14 23:15:15 +03:00
5ab5d5fb25
arm64: optimize q6_k_q8_k kernel with i8mm ( #13519 )
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
b5388
2025-05-14 21:53:52 +02:00
3198405e98
common
: add partial regex support (#12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5387
2025-05-14 19:50:57 +01:00
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 ( #13546 )
2025-05-14 21:22:49 +03:00
017f10b5fa
fix: crash when calling llama_state_get_size
on a context without a KV cache ( #13542 )
b5385
2025-05-14 19:18:18 +03:00
4696d56749
CUDA: fix crash on large batch size for quant. MoE ( #13537 )
b5384
2025-05-14 16:41:02 +02:00
b7d2672082
llama : fix quantize with dl backends ( #13539 )
2025-05-14 16:12:36 +02:00
6da34fa276
CUDA: faster Deepseek FA, add Turing support ( #13435 )
b5382
2025-05-14 16:08:20 +02:00
5e7d95e22e
fix: Move build_inp_pos to the top of the graph section for build_granite ( #13538 )
...
This matches how others do it, but will still avoid the extra
initialization when rope is disabled.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
b5381
2025-05-14 15:53:59 +03:00
053174436f
server : passthrough the /models endpoint during loading ( #13535 )
...
* server : passthrough the /models endpoint during loading
* server : update readme + return json for "meta" field
b5380
2025-05-14 15:42:10 +03:00
360a9c98e1
server : fix cache_tokens bug with no cache_prompt ( #13533 )
b5379
2025-05-14 13:35:07 +02:00
09d13d94fb
cmake: simplify vulkan shader test logic ( #13263 )
b5378
2025-05-14 07:53:57 -03:00
24e86cae72
vulkan: KHR_coopmat flash attention ( #13506 )
...
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
b5377
2025-05-14 11:55:26 +02:00
bb1681fbd5
webui : use fflate for more deterministic gzip compress ( #13525 )
...
* webui : use pako for more deterministic gzip compress
* simpler code
* use fflate instead of pako
2025-05-14 10:26:12 +02:00
d486dd3e8e
webui: Allow pasting file from clipboard ( #13526 )
...
* server: Allow pasting file from clipboard
* server: Prevent default action on file paste
* update build
* format then build combined
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-14 10:07:31 +02:00
21ca987fba
docs: Update link to ggml-org in multimodal.md ( #13513 )
...
* Update multimodal.md
Minor change to include the huggingface link
* Update docs/multimodal.md
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com >
2025-05-14 09:59:12 +02:00
be1d4a13db
scripts : fix compare-llama-bench.py show parameter ( #13514 )
2025-05-14 08:41:01 +02:00
ab3971f2a0
vulkan: workaround FA compile failures on macos ( #13517 )
b5372
2025-05-14 06:15:50 +02:00
e5c834f718
quantize : improve tensor-type pattern matching ( #13033 )
b5371
2025-05-13 19:12:31 +02:00
71bdbdb587
clip : clip.h become private API ( ⚠️ breaking change) ( #13510 )
b5370
2025-05-13 17:07:21 +02:00
f0995d28ce
metal : use FA-vec kernel up to batch size 20 ( #13496 )
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
b5369
2025-05-13 18:04:39 +03:00