Commit Graph

5752 Commits

Author SHA1 Message Date
745aa5319b llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401 context : fix SWA-related warning for multiple sequences (#14045) b5601 2025-06-06 13:29:18 +03:00
d17a809ef0 llama : support multiple classifier outputs and labels (#13940) b5600 2025-06-06 09:03:25 +02:00
1caae7fc6c gguf-py : add add_classifier_output_labels method to writer (#14031)
* add add_classifier_output_labels

* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3 ci: fix CUDA build failure on autodl cloud machines (#14005)
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.

Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-06-05 16:25:29 +03:00
7f37b6cf1e memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4 llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) b5595 2025-06-05 11:57:42 +02:00
d01d112abb readme : add badge (#13938) 2025-06-05 10:50:55 +03:00
9f47fa5792 vocab : warn about missing mask token (#14022) b5593 2025-06-05 09:29:18 +02:00
9e31bec4fd context : fix pos_min initialization upon error decode (#14008)
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c vulkan: automatically deduce size of push constants (#13936) b5591 2025-06-05 07:17:58 +02:00
0d3984424f ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7 kv-cache : refactor the update/defrag mechanism (#13988)
* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704 ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997) b5588 2025-06-04 15:37:40 +02:00
482548716f releases : use dl backend for linux release, remove arm64 linux release (#13996) b5587 2025-06-04 13:15:54 +02:00
3ac67535c8 llama-graph : use ggml_repeat_4d (#13998) b5586 2025-06-04 10:11:26 +02:00
0b4be4c435 CUDA: fix FTZ in FA for Gemma 3 (#13991) b5585 2025-06-04 08:57:05 +02:00
e0e806f52e kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8 vulkan: fix warnings in perf logger querypool code (#13937) 2025-06-03 20:30:22 +02:00
ea1431b0fa docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9 opencl: add backend_synchronize (#13939)
* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0 OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288 server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00
ea394d7ab1 metal : use F32 accumulators in FA kernels (#13975)
ggml-ci
b5578
2025-06-02 21:33:40 +03:00
5582c49c39 gemma : more consistent attention scaling for v2 and v3 (#13951)
* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling
b5577
2025-06-02 20:54:26 +03:00
c9bbc77931 server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
b5576
2025-06-02 10:15:44 -07:00
bfd322796c mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5575
2025-06-02 16:29:28 +02:00
093e3f1feb cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".

This patch provides a fix by first converting the string to uppercase before applying the regex.

Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com>
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>
b5574
2025-06-02 15:18:36 +03:00
663445b0de sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
* [WIP]: fuse q8 quantization and reorder

* wip2: fuse q8 quantization and reorder

* working q8 reorder commit

* restored common.hpp

* remove debug prints

* remove unnecessary headers and remove trailing whitespace

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>
b5573
2025-06-02 10:12:20 +01:00
7675c555a1 gguf: fix failure on version == 0 (#13956) b5572 2025-06-01 18:08:05 +02:00
5e1c3aed40 convert : fix nomic-bert-moe mask token (#13757) b5571 2025-06-01 18:07:21 +02:00
c496fe0b1d convert : fix vocab padding code for bert models (#13954) 2025-06-01 17:23:11 +02:00
e57bb87ced ggml: check if non-native endian model is being loaded (#13943)
* gguf: prevent non-native endian models from being loaded

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: update error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: make the non-native endian check more verbose

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_assert location

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: reword the endianness check error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b5569
2025-06-01 16:53:57 +02:00
f3a4b1659c sync : ggml
ggml-ci
b5568
2025-06-01 13:43:57 +03:00
108009f5c7 vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 13:43:57 +03:00
d337252acf cmake : Fix broken CMake error messages (ggml/1252) 2025-06-01 13:43:57 +03:00
af6f91db47 ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit 9d0762e.

closes: #1235
2025-06-01 13:43:57 +03:00
a7b8d35f78 sync : whisper.cpp (ggml/1250)
* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com>
2025-06-01 13:43:57 +03:00
6eba72b71c ggml : install dynamic backends (ggml/1240)
* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
fedf034a98 ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
8726392d3d readme : update bindings (#13950) 2025-06-01 11:44:30 +03:00
c04621711a parallel : fix n_junk == 0 (#13952) b5560 2025-06-01 11:42:16 +03:00
0fc16b42e8 kv-cache : split implementation in separate sources (#13920)
ggml-ci
b5559
2025-06-01 11:39:27 +03:00
053b1539c0 threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (#12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5558
2025-05-31 15:39:19 -07:00
b3a89c3d9e docs : Note about necessity of having libcurl installed for standard build. (#13945)
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
2025-05-31 18:58:35 +02:00
e15898d1c7 server: allow unclosed thinking tags (#13931) b5556 2025-05-31 08:26:10 -07:00
803f8baf4f llama : deprecate explicit kv_self defrag/update calls (#13921)
ggml-ci
b5555
2025-05-31 15:58:33 +03:00
3600cc2886 llama : use n_swa + n_ubatch cells for SWA cache (#13833)
* llama : use n_swa + n_ubatch cells for SWA cache

ggml-ci

* llama : add warning about multi-sqeuence SWA contexts
b5554
2025-05-31 15:57:44 +03:00
c7e0a2054b webui : Replace alert and confirm with custom modals. (#13711)
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.

* use Modal Provider to simplify the use of confirm and alert modals.

* Increase the z index of the modal dialogs.

* Update index.html.gz

* also add showPrompt

* rebuild

---------

Co-authored-by: igardev <ivailo.gardev@akros.ch>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-31 11:56:08 +02:00