1f63e75f3b
metal : use less stack memory in FA kernel ( #14088 )
...
* metal : use less stack memory in FA kernel
ggml-ci
* cont : fix BF16 variant
b5618
2025-06-09 23:05:02 +03:00
40cbf571c9
kv-cache : fix shift and defrag logic ( #14081 )
...
* kv-cache : fix shift
ggml-ci
* cont : reset shift[i]
ggml-ci
* cont : fix defrag erasing cells that didn't move
ggml-ci
b5617
2025-06-09 23:04:35 +03:00
7f4fbe5183
llama : allow building all tests on windows when not using shared libs ( #13980 )
...
* llama : allow building all tests on windows when not using shared libraries
* add static windows build to ci
* tests : enable debug logs for test-chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5616
2025-06-09 20:03:09 +02:00
f470bc36be
ggml-cpu : split arch-specific implementations ( #13892 )
...
* move ggml-cpu-aarch64 to repack
* split quantize_row_q8_0/1
* split helper functions
* split ggml_vec_dot_q4_0_q8_0
* split ggml_vec_dot_q4_1_q8_1
* split ggml_vec_dot_q5_0_q8_0
* split ggml_vec_dot_q5_1_q8_1
* split ggml_vec_dot_q8_0_q8_0
* split ggml_vec_dot_tq1_0_q8_K
* split ggml_vec_dot_tq2_0_q8_K
* split ggml_vec_dot_q2_K_q8_K
* split ggml_vec_dot_q3_K_q8_K
* split ggml_vec_dot_q4_K_q8_K
* split ggml_vec_dot_q5_K_q8_K
* split ggml_vec_dot_q6_K_q8_K
* split ggml_vec_dot_iq2_xxs_q8_K
* split ggml_vec_dot_iq2_xs_q8_K
* split ggml_vec_dot_iq2_s_q8_K
* split ggml_vec_dot_iq3_xxs_q8_K
* split ggml_vec_dot_iq3_s_q8_K
* split ggml_vec_dot_iq1_s_q8_K
* split ggml_vec_dot_iq1_m_q8_K
* split ggml_vec_dot_iq4_nl_q8_0
* split ggml_vec_dot_iq4_xs_q8_K
* fix typos
* fix missing prototypes
* rename ggml-cpu-quants.c
* rename ggml-cpu-traits
* rename arm folder
* move cpu-feats-x86.cpp
* rename ggml-cpu-hbm
* update arm detection macro in quants.c
* move iq quant tables
* split ggml_quantize_mat_q8_0/K
* split ggml_gemv_*
* split ggml_gemm_*
* rename namespace aarch64 to repack
* use weak aliases to replace test macros
* rename GGML_CPU_AARCH64 to GGML_CPU_REPACK
* rename more aarch64 to repack
* clean up rebase leftover
* fix compilation errors
* remove trailing spaces
* try to fix clang compilation errors
* try to fix clang compilation errors again
* try to fix clang compilation errors, 3rd attempt
* try to fix clang compilation errors, 4th attempt
* try to fix clang compilation errors, 5th attempt
* try to fix clang compilation errors, 6th attempt
* try to fix clang compilation errors, 7th attempt
* try to fix clang compilation errors, 8th attempt
* try to fix clang compilation errors, 9th attempt
* more cleanup
* fix compilation errors
* fix apple targets
* fix a typo in arm version of ggml_vec_dot_q4_K_q8_K
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5615
2025-06-09 16:47:13 +02:00
8f47e25f56
cuda : fix device sync on buffer clear ( #14033 )
b5614
2025-06-09 16:36:26 +02:00
201b31dc2e
graph : fix geglu ( #14077 )
...
ggml-ci
b5613
2025-06-09 17:17:31 +03:00
e21d2d4ae2
CANN: Simplify the environment variable setting( #13104 )
...
* Simplify the environment variable setting to specify the memory pool type.
* Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
* update
* fix CI
* update
* delete whitespace
* fix according to review
* update CANN.md
* update CANN.md
b5612
2025-06-09 19:47:39 +08:00
dc0623fddb
webui: fix sidebar being covered by main content ( #14082 )
...
* webui: fix sidebar being covered by main content
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* webui: update index.html.gz
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-06-09 12:01:17 +02:00
87d34b381d
server : fix LRU check ( #14079 )
...
ggml-ci
b5610
2025-06-09 12:57:58 +03:00
b460d16ae8
sycl: Add reorder to Q6_K mmvq implementation ( #13885 )
...
* Add Reorder to Q6_K mmvq implementation
* Address PR comments: clean up comments
* Remove unused parameter after refactoring q4_k
* Adding inline to function and removing unnecessary reference to int
---------
Signed-off-by: nscipione <nicolo.scipione@codeplay.com >
b5609
2025-06-09 11:47:07 +02:00
91a8ee6a6f
add geglu activation function ( #14074 )
...
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp >
b5608
2025-06-09 05:15:31 +01:00
056eb74534
CANN: Enable labeler for Ascend NPU ( #13914 )
2025-06-09 11:20:06 +08:00
247e5c6e44
cuda : fix buffer type check with integrated GPUs ( #14069 )
b5606
2025-06-08 11:39:56 -07:00
5787b5da57
ci: add LoongArch cross-compile build ( #13944 )
2025-06-07 10:39:11 -03:00
228f34c9ce
SYCL: Implement few same quantized type copy kernels ( #13739 )
...
* SYCL: Implement few same quantized type copy kernels
* Use memcpy for copying contiguous tensors
ggml-ci
* feat(sycl): add contiguous tensor copy support and device checks
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
* refactor: replace specific block copy functions with template
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
* Exclude BF16 support for COPY tensors for now
ggml-ci
* perf: adjust SYCL copy kernel block sizes for efficiency
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
b5604
2025-06-07 18:58:20 +05:30
0974ad7a7c
llama : fix llama_model_chat_template with template name (LLM_KV with suffix) ( #14050 )
b5603
2025-06-07 14:13:12 +02:00
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
b5601
2025-06-06 13:29:18 +03:00
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
b5600
2025-06-06 09:03:25 +02:00
1caae7fc6c
gguf-py : add add_classifier_output_labels method to writer ( #14031 )
...
* add add_classifier_output_labels
* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6
vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs ( #14001 )
...
* allowing B580 and U9-288V
* experimenting code to detect Xe2
* allowing coopmat only for Xe2 GPUs
* fixed comment wording
* fixed comment wording
* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3
ci: fix CUDA build failure on autodl cloud machines ( #14005 )
...
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.
Co-authored-by: pockers21 <liyang2@uniontech.com >
2025-06-05 16:25:29 +03:00
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4
llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources ( #14013 )
b5595
2025-06-05 11:57:42 +02:00
d01d112abb
readme : add badge ( #13938 )
2025-06-05 10:50:55 +03:00
9f47fa5792
vocab : warn about missing mask token ( #14022 )
b5593
2025-06-05 09:29:18 +02:00
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c
vulkan: automatically deduce size of push constants ( #13936 )
b5591
2025-06-05 07:17:58 +02:00
0d3984424f
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D ( #13813 )
...
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704
ci : remove cuda 11.7 releases, switch runner to windows 2022 ( #13997 )
b5588
2025-06-04 15:37:40 +02:00
482548716f
releases : use dl backend for linux release, remove arm64 linux release ( #13996 )
b5587
2025-06-04 13:15:54 +02:00
3ac67535c8
llama-graph : use ggml_repeat_4d ( #13998 )
b5586
2025-06-04 10:11:26 +02:00
0b4be4c435
CUDA: fix FTZ in FA for Gemma 3 ( #13991 )
b5585
2025-06-04 08:57:05 +02:00
e0e806f52e
kv-cache : fix unified::seq_rm to work with seq_id < 0 ( #13985 )
...
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8
vulkan: fix warnings in perf logger querypool code ( #13937 )
2025-06-03 20:30:22 +02:00
ea1431b0fa
docs : add "Quick start" section for new users ( #13862 )
...
* docs : add "Quick start" section for non-technical users
* rm flox
* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9
opencl: add backend_synchronize
( #13939 )
...
* This is not needed by the normal use where the result is read
using `tensor_get`, but it allows perf mode of `test-backend-ops`
to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat ( #13840 )
...
* add concat, pad, repeat, tsembd, tanh, upscale
* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288
server : disable speculative decoding for SWA models ( #13970 )
...
* server : use swa-full fo draft context
ggml-ci
* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00
ea394d7ab1
metal : use F32 accumulators in FA kernels ( #13975 )
...
ggml-ci
b5578
2025-06-02 21:33:40 +03:00
5582c49c39
gemma : more consistent attention scaling for v2 and v3 ( #13951 )
...
* gemma : fix attn scale for 27B
* cont : apply scale before attn
* cont : consistent attention scaling
b5577
2025-06-02 20:54:26 +03:00
c9bbc77931
server
: update deepseek reasoning format (pass reasoning_content as diffs) (#13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
b5576
2025-06-02 10:15:44 -07:00
bfd322796c
mtmd : fix memory leak in mtmd_helper_eval_chunk_single ( #13961 )
...
* mtmd : fix memory in mtmd_helper_eval_chunk_single
* mtmd-cli : fix mem leak
* Update tools/mtmd/mtmd-cli.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
b5575
2025-06-02 16:29:28 +02:00
093e3f1feb
cmake : Handle mixed-case 'Power' strings in POWER CPU detection ( #13966 )
...
Some systems report the CPU implementation as "Power11" instead of "POWER11".
The existing CMake logic uses a case-sensitive regular expression to extract
the CPU generation, which fails when the casing doesn't exactly match "POWER".
This patch provides a fix by first converting the string to uppercase before applying the regex.
Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com >
Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com >
b5574
2025-06-02 15:18:36 +03:00
663445b0de
sycl: quantize and reorder the input to q8_1 when reorder is enabled ( #13826 )
...
* [WIP]: fuse q8 quantization and reorder
* wip2: fuse q8 quantization and reorder
* working q8 reorder commit
* restored common.hpp
* remove debug prints
* remove unnecessary headers and remove trailing whitespace
* Update ggml/src/ggml-sycl/ggml-sycl.cpp
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
---------
Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com >
b5573
2025-06-02 10:12:20 +01:00
7675c555a1
gguf: fix failure on version == 0 ( #13956 )
b5572
2025-06-01 18:08:05 +02:00
5e1c3aed40
convert : fix nomic-bert-moe mask token ( #13757 )
b5571
2025-06-01 18:07:21 +02:00
c496fe0b1d
convert : fix vocab padding code for bert models ( #13954 )
2025-06-01 17:23:11 +02:00
e57bb87ced
ggml: check if non-native endian model is being loaded ( #13943 )
...
* gguf: prevent non-native endian models from being loaded
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: update error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* gguf: make the non-native endian check more verbose
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: move ggml_assert location
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: reword the endianness check error message
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
b5569
2025-06-01 16:53:57 +02:00