745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
b5601
2025-06-06 13:29:18 +03:00
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
b5600
2025-06-06 09:03:25 +02:00
1caae7fc6c
gguf-py : add add_classifier_output_labels method to writer ( #14031 )
...
* add add_classifier_output_labels
* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6
vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs ( #14001 )
...
* allowing B580 and U9-288V
* experimenting code to detect Xe2
* allowing coopmat only for Xe2 GPUs
* fixed comment wording
* fixed comment wording
* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3
ci: fix CUDA build failure on autodl cloud machines ( #14005 )
...
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.
Co-authored-by: pockers21 <liyang2@uniontech.com >
2025-06-05 16:25:29 +03:00
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4
llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources ( #14013 )
b5595
2025-06-05 11:57:42 +02:00
d01d112abb
readme : add badge ( #13938 )
2025-06-05 10:50:55 +03:00
9f47fa5792
vocab : warn about missing mask token ( #14022 )
b5593
2025-06-05 09:29:18 +02:00
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c
vulkan: automatically deduce size of push constants ( #13936 )
b5591
2025-06-05 07:17:58 +02:00
0d3984424f
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D ( #13813 )
...
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704
ci : remove cuda 11.7 releases, switch runner to windows 2022 ( #13997 )
b5588
2025-06-04 15:37:40 +02:00
482548716f
releases : use dl backend for linux release, remove arm64 linux release ( #13996 )
b5587
2025-06-04 13:15:54 +02:00
3ac67535c8
llama-graph : use ggml_repeat_4d ( #13998 )
b5586
2025-06-04 10:11:26 +02:00
0b4be4c435
CUDA: fix FTZ in FA for Gemma 3 ( #13991 )
b5585
2025-06-04 08:57:05 +02:00
e0e806f52e
kv-cache : fix unified::seq_rm to work with seq_id < 0 ( #13985 )
...
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8
vulkan: fix warnings in perf logger querypool code ( #13937 )
2025-06-03 20:30:22 +02:00
ea1431b0fa
docs : add "Quick start" section for new users ( #13862 )
...
* docs : add "Quick start" section for non-technical users
* rm flox
* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9
opencl: add backend_synchronize
( #13939 )
...
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat ( #13840 )
...
* add concat, pad, repeat, tsembd, tanh, upscale
* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288
server : disable speculative decoding for SWA models ( #13970 )
...
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00
ea394d7ab1
metal : use F32 accumulators in FA kernels ( #13975 )
...
ggml-ci
b5578
2025-06-02 21:33:40 +03:00
5582c49c39
gemma : more consistent attention scaling for v2 and v3 ( #13951 )
...
* gemma : fix attn scale for 27B
* cont : apply scale before attn
* cont : consistent attention scaling
b5577
2025-06-02 20:54:26 +03:00
c9bbc77931
server
: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
b5576
2025-06-02 10:15:44 -07:00
bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single ( #13961 )
...
* mtmd : fix memory in mtmd_helper_eval_chunk_single
* mtmd-cli : fix mem leak
* Update tools/mtmd/mtmd-cli.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5575
2025-06-02 16:29:28 +02:00
093e3f1feb
cmake : Handle mixed-case 'Power' strings in POWER CPU detection ( #13966 )
...
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".
This patch provides a fix by first converting the string to uppercase before applying the regex.
Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com >
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com >
b5574
2025-06-02 15:18:36 +03:00
663445b0de
sycl: quantize and reorder the input to q8_1 when reorder is enabled ( #13826 )
...
* [WIP]: fuse q8 quantization and reorder
* wip2: fuse q8 quantization and reorder
* working q8 reorder commit
* restored common.hpp
* remove debug prints
* remove unnecessary headers and remove trailing whitespace
* Update ggml/src/ggml-sycl/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
---------
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
b5573
2025-06-02 10:12:20 +01:00
7675c555a1
gguf: fix failure on version == 0 ( #13956 )
b5572
2025-06-01 18:08:05 +02:00
5e1c3aed40
convert : fix nomic-bert-moe mask token ( #13757 )
b5571
2025-06-01 18:07:21 +02:00
c496fe0b1d
convert : fix vocab padding code for bert models ( #13954 )
2025-06-01 17:23:11 +02:00
e57bb87ced
ggml: check if non-native endian model is being loaded ( #13943 )
...
* gguf: prevent non-native endian models from being loaded
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: update error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: make the non-native endian check more verbose
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: move ggml_assert location
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: reword the endianness check error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
b5569
2025-06-01 16:53:57 +02:00
f3a4b1659c
sync : ggml
...
ggml-ci
b5568
2025-06-01 13:43:57 +03:00
108009f5c7
vulkan : Remove unexpected ; (ggml/1253)
2025-06-01 13:43:57 +03:00
d337252acf
cmake : Fix broken CMake error messages (ggml/1252)
2025-06-01 13:43:57 +03:00
af6f91db47
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
...
The implementation is already deleted with commit 9d0762e.
closes : #1235
2025-06-01 13:43:57 +03:00
a7b8d35f78
sync : whisper.cpp (ggml/1250)
...
* ggml : Fix backtrace breaking Windows build (whisper/3203)
* sync : whisper.cpp
ggml-ci
---------
Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com >
2025-06-01 13:43:57 +03:00
6eba72b71c
ggml : install dynamic backends (ggml/1240)
...
* ggml : install dynamic backends
Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
fedf034a98
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
...
The goal is to have what users call "full logs" contain the backtrace.
This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
8726392d3d
readme : update bindings ( #13950 )
2025-06-01 11:44:30 +03:00
c04621711a
parallel : fix n_junk == 0 ( #13952 )
b5560
2025-06-01 11:42:16 +03:00
0fc16b42e8
kv-cache : split implementation in separate sources ( #13920 )
...
ggml-ci
b5559
2025-06-01 11:39:27 +03:00
053b1539c0
threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling ( #12995 )
...
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling
We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.
Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* threading: disable SetThreadInfo() calls for older Windows versions
* Update tools/llama-bench/llama-bench.cpp
Co-authored-by: Diego Devesa <slarengh@gmail.com >
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
b5558
2025-05-31 15:39:19 -07:00
b3a89c3d9e
docs : Note about necessity of having libcurl installed for standard build. ( #13945 )
...
Signed-off-by: Jiri Podivin <jpodivin@gmail.com >
2025-05-31 18:58:35 +02:00
e15898d1c7
server: allow unclosed thinking tags ( #13931 )
b5556
2025-05-31 08:26:10 -07:00
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
b5555
2025-05-31 15:58:33 +03:00
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
b5554
2025-05-31 15:57:44 +03:00
c7e0a2054b
webui : Replace alert and confirm with custom modals. ( #13711 )
...
* Replace alert and confirm with custom modals. This is needed as Webview in VS Code doesn't permit alert and confirm for security reasons.
* use Modal Provider to simplify the use of confirm and alert modals.
* Increase the z index of the modal dialogs.
* Update index.html.gz
* also add showPrompt
* rebuild
---------
Co-authored-by: igardev <ivailo.gardev@akros.ch >
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2025-05-31 11:56:08 +02:00