5758 Commits

Author SHA1 Message Date
91a8ee6a6f add geglu activation function (#14074)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5608
2025-06-09 05:15:31 +01:00
056eb74534 CANN: Enable labeler for Ascend NPU (#13914) 2025-06-09 11:20:06 +08:00
247e5c6e44 cuda : fix buffer type check with integrated GPUs (#14069) b5606 2025-06-08 11:39:56 -07:00
5787b5da57 ci: add LoongArch cross-compile build (#13944) 2025-06-07 10:39:11 -03:00
228f34c9ce SYCL: Implement few same quantized type copy kernels (#13739)
* SYCL: Implement few same quantized type copy kernels

* Use memcpy for copying contiguous tensors

ggml-ci

* feat(sycl): add contiguous tensor copy support and device checks

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

* refactor: replace specific block copy functions with template

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

* Exclude BF16 support for COPY tensors for now
ggml-ci

* perf: adjust SYCL copy kernel block sizes for efficiency

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
b5604
2025-06-07 18:58:20 +05:30
0974ad7a7c llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050) b5603 2025-06-07 14:13:12 +02:00
745aa5319b llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401 context : fix SWA-related warning for multiple sequences (#14045) b5601 2025-06-06 13:29:18 +03:00
d17a809ef0 llama : support multiple classifier outputs and labels (#13940) b5600 2025-06-06 09:03:25 +02:00
1caae7fc6c gguf-py : add add_classifier_output_labels method to writer (#14031)
* add add_classifier_output_labels

* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3 ci: fix CUDA build failure on autodl cloud machines (#14005)
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.

Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-06-05 16:25:29 +03:00
7f37b6cf1e memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4 llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) b5595 2025-06-05 11:57:42 +02:00
d01d112abb readme : add badge (#13938) 2025-06-05 10:50:55 +03:00
9f47fa5792 vocab : warn about missing mask token (#14022) b5593 2025-06-05 09:29:18 +02:00
9e31bec4fd context : fix pos_min initialization upon error decode (#14008)
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c vulkan: automatically deduce size of push constants (#13936) b5591 2025-06-05 07:17:58 +02:00
0d3984424f ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7 kv-cache : refactor the update/defrag mechanism (#13988)
* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704 ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997) b5588 2025-06-04 15:37:40 +02:00
482548716f releases : use dl backend for linux release, remove arm64 linux release (#13996) b5587 2025-06-04 13:15:54 +02:00
3ac67535c8 llama-graph : use ggml_repeat_4d (#13998) b5586 2025-06-04 10:11:26 +02:00
0b4be4c435 CUDA: fix FTZ in FA for Gemma 3 (#13991) b5585 2025-06-04 08:57:05 +02:00
e0e806f52e kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8 vulkan: fix warnings in perf logger querypool code (#13937) 2025-06-03 20:30:22 +02:00
ea1431b0fa docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9 opencl: add backend_synchronize (#13939)
* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0 OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288 server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00
ea394d7ab1 metal : use F32 accumulators in FA kernels (#13975)
ggml-ci
b5578
2025-06-02 21:33:40 +03:00
5582c49c39 gemma : more consistent attention scaling for v2 and v3 (#13951)
* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling
b5577
2025-06-02 20:54:26 +03:00
c9bbc77931 server: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
b5576
2025-06-02 10:15:44 -07:00
bfd322796c mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961)
* mtmd : fix memory in mtmd_helper_eval_chunk_single

* mtmd-cli : fix mem leak

* Update tools/mtmd/mtmd-cli.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5575
2025-06-02 16:29:28 +02:00
093e3f1feb cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13966)
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".

This patch provides a fix by first converting the string to uppercase before applying the regex.

Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com>
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>
b5574
2025-06-02 15:18:36 +03:00
663445b0de sycl: quantize and reorder the input to q8_1 when reorder is enabled (#13826)
* [WIP]: fuse q8 quantization and reorder

* wip2: fuse q8 quantization and reorder

* working q8 reorder commit

* restored common.hpp

* remove debug prints

* remove unnecessary headers and remove trailing whitespace

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>
b5573
2025-06-02 10:12:20 +01:00
7675c555a1 gguf: fix failure on version == 0 (#13956) b5572 2025-06-01 18:08:05 +02:00
5e1c3aed40 convert : fix nomic-bert-moe mask token (#13757) b5571 2025-06-01 18:07:21 +02:00
c496fe0b1d convert : fix vocab padding code for bert models (#13954) 2025-06-01 17:23:11 +02:00
e57bb87ced ggml: check if non-native endian model is being loaded (#13943)
* gguf: prevent non-native endian models from being loaded

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: update error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* gguf: make the non-native endian check more verbose

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: move ggml_assert location

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml: reword the endianness check error message

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b5569
2025-06-01 16:53:57 +02:00
f3a4b1659c sync : ggml
ggml-ci
b5568
2025-06-01 13:43:57 +03:00
108009f5c7 vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 13:43:57 +03:00
d337252acf cmake : Fix broken CMake error messages (ggml/1252) 2025-06-01 13:43:57 +03:00
af6f91db47 ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit 9d0762e.

closes: #1235
2025-06-01 13:43:57 +03:00
a7b8d35f78 sync : whisper.cpp (ggml/1250)
* ggml : Fix backtrace breaking Windows build (whisper/3203)

* sync : whisper.cpp

ggml-ci

---------

Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com>
2025-06-01 13:43:57 +03:00
6eba72b71c ggml : install dynamic backends (ggml/1240)
* ggml : install dynamic backends

Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
fedf034a98 ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
8726392d3d readme : update bindings (#13950) 2025-06-01 11:44:30 +03:00
c04621711a parallel : fix n_junk == 0 (#13952) b5560 2025-06-01 11:42:16 +03:00
0fc16b42e8 kv-cache : split implementation in separate sources (#13920)
ggml-ci
b5559
2025-06-01 11:39:27 +03:00