91a8ee6a6f
add geglu activation function ( #14074 )
...
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
b5608
2025-06-09 05:15:31 +01:00
056eb74534
CANN: Enable labeler for Ascend NPU ( #13914 )
2025-06-09 11:20:06 +08:00
247e5c6e44
cuda : fix buffer type check with integrated GPUs ( #14069 )
b5606
2025-06-08 11:39:56 -07:00
5787b5da57
ci: add LoongArch cross-compile build ( #13944 )
2025-06-07 10:39:11 -03:00
228f34c9ce
SYCL: Implement few same quantized type copy kernels ( #13739 )
...
* SYCL: Implement few same quantized type copy kernels
* Use memcpy for copying contiguous tensors
ggml-ci
* feat(sycl): add contiguous tensor copy support and device checks
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
* refactor: replace specific block copy functions with template
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
* Exclude BF16 support for COPY tensors for now
ggml-ci
* perf: adjust SYCL copy kernel block sizes for efficiency
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
b5604
2025-06-07 18:58:20 +05:30
0974ad7a7c
llama : fix llama_model_chat_template with template name (LLM_KV with suffix) ( #14050 )
b5603
2025-06-07 14:13:12 +02:00
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
b5601
2025-06-06 13:29:18 +03:00
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
b5600
2025-06-06 09:03:25 +02:00
1caae7fc6c
gguf-py : add add_classifier_output_labels method to writer ( #14031 )
...
* add add_classifier_output_labels
* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6
vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs ( #14001 )
...
* allowing B580 and U9-288V
* experimenting code to detect Xe2
* allowing coopmat only for Xe2 GPUs
* fixed comment wording
* fixed comment wording
* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3
ci: fix CUDA build failure on autodl cloud machines ( #14005 )
...
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.
Co-authored-by: pockers21 <liyang2@uniontech.com >
2025-06-05 16:25:29 +03:00
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4
llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources ( #14013 )
b5595
2025-06-05 11:57:42 +02:00
d01d112abb
readme : add badge ( #13938 )
2025-06-05 10:50:55 +03:00
9f47fa5792
vocab : warn about missing mask token ( #14022 )
b5593
2025-06-05 09:29:18 +02:00
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c
vulkan: automatically deduce size of push constants ( #13936 )
b5591
2025-06-05 07:17:58 +02:00
0d3984424f
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D ( #13813 )
...
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704
ci : remove cuda 11.7 releases, switch runner to windows 2022 ( #13997 )
b5588
2025-06-04 15:37:40 +02:00
482548716f
releases : use dl backend for linux release, remove arm64 linux release ( #13996 )
b5587
2025-06-04 13:15:54 +02:00
3ac67535c8
llama-graph : use ggml_repeat_4d ( #13998 )
b5586
2025-06-04 10:11:26 +02:00
0b4be4c435
CUDA: fix FTZ in FA for Gemma 3 ( #13991 )
b5585
2025-06-04 08:57:05 +02:00
e0e806f52e
kv-cache : fix unified::seq_rm to work with seq_id < 0 ( #13985 )
...
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8
vulkan: fix warnings in perf logger querypool code ( #13937 )
2025-06-03 20:30:22 +02:00
ea1431b0fa
docs : add "Quick start" section for new users ( #13862 )
...
* docs : add "Quick start" section for non-technical users
* rm flox
* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9
opencl: add backend_synchronize
( #13939 )
...
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat ( #13840 )
...
* add concat, pad, repeat, tsembd, tanh, upscale
* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288
server : disable speculative decoding for SWA models ( #13970 )
...
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00
ea394d7ab1
metal : use F32 accumulators in FA kernels ( #13975 )
...
ggml-ci
b5578
2025-06-02 21:33:40 +03:00
5582c49c39
gemma : more consistent attention scaling for v2 and v3 ( #13951 )
...
* gemma : fix attn scale for 27B
* cont : apply scale before attn
* cont : consistent attention scaling
b5577
2025-06-02 20:54:26 +03:00
c9bbc77931
server
: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
b5576
2025-06-02 10:15:44 -07:00
bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single ( #13961 )
...
* mtmd : fix memory in mtmd_helper_eval_chunk_single
* mtmd-cli : fix mem leak
* Update tools/mtmd/mtmd-cli.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5575
2025-06-02 16:29:28 +02:00
093e3f1feb
cmake : Handle mixed-case 'Power' strings in POWER CPU detection ( #13966 )
...
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".
This patch provides a fix by first converting the string to uppercase before applying the regex.
Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com >
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com >
b5574
2025-06-02 15:18:36 +03:00
663445b0de
sycl: quantize and reorder the input to q8_1 when reorder is enabled ( #13826 )
...
* [WIP]: fuse q8 quantization and reorder
* wip2: fuse q8 quantization and reorder
* working q8 reorder commit
* restored common.hpp
* remove debug prints
* remove unnecessary headers and remove trailing whitespace
* Update ggml/src/ggml-sycl/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
---------
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
b5573
2025-06-02 10:12:20 +01:00
7675c555a1
gguf: fix failure on version == 0 ( #13956 )
b5572
2025-06-01 18:08:05 +02:00
5e1c3aed40
convert : fix nomic-bert-moe mask token ( #13757 )
b5571
2025-06-01 18:07:21 +02:00
c496fe0b1d
convert : fix vocab padding code for bert models ( #13954 )
2025-06-01 17:23:11 +02:00
e57bb87ced
ggml: check if non-native endian model is being loaded ( #13943 )
...
* gguf: prevent non-native endian models from being loaded
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: update error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: make the non-native endian check more verbose
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: move ggml_assert location
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: reword the endianness check error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
b5569
2025-06-01 16:53:57 +02:00
f3a4b1659c
sync : ggml
...
ggml-ci
b5568
2025-06-01 13:43:57 +03:00
108009f5c7
vulkan : Remove unexpected ; (ggml/1253)
2025-06-01 13:43:57 +03:00
d337252acf
cmake : Fix broken CMake error messages (ggml/1252)
2025-06-01 13:43:57 +03:00
af6f91db47
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
...
The implementation is already deleted with commit 9d0762e.
closes : #1235
2025-06-01 13:43:57 +03:00
a7b8d35f78
sync : whisper.cpp (ggml/1250)
...
* ggml : Fix backtrace breaking Windows build (whisper/3203)
* sync : whisper.cpp
ggml-ci
---------
Co-authored-by: Daniel Tang <danielzgtg.opensource@gmail.com >
2025-06-01 13:43:57 +03:00
6eba72b71c
ggml : install dynamic backends (ggml/1240)
...
* ggml : install dynamic backends
Make sure dynamic backends are installed in $CMAKE_INSTALL_BINDIR
2025-06-01 13:43:57 +03:00
fedf034a98
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
...
The goal is to have what users call "full logs" contain the backtrace.
This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-06-01 13:43:57 +03:00
8726392d3d
readme : update bindings ( #13950 )
2025-06-01 11:44:30 +03:00
c04621711a
parallel : fix n_junk == 0 ( #13952 )
b5560
2025-06-01 11:42:16 +03:00
0fc16b42e8
kv-cache : split implementation in separate sources ( #13920 )
...
ggml-ci
b5559
2025-06-01 11:39:27 +03:00