Commit Graph

5678 Commits

Author SHA1 Message Date
55f6b9fa65 convert : fix duplicate key DeepSeek-R1 conversion error (#14103) 2025-06-10 23:29:52 +02:00
3678b838bb llama : support GEGLU for jina-bert-v2 (#14090) b5627 2025-06-10 18:02:08 +02:00
652b70e667 vulkan: force device 0 in CI (#14106) 2025-06-10 10:53:47 -05:00
3a12db23b6 Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104) b5625 2025-06-10 16:48:07 +01:00
ae92c1855b sync : ggml
ggml-ci
b5624
2025-06-10 18:39:33 +03:00
b7ce1ad1e3 ggml : fix weak alias win32 (whisper/0)
ggml-ci
2025-06-10 18:39:33 +03:00
97340b4c99 Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099) b5622 2025-06-10 13:01:33 +01:00
2bb0467043 rpc : nicer error messages for RPC server crash (#14076) b5621 2025-06-10 09:41:01 +03:00
b8e2194efc sync : ggml
ggml-ci
b5620
2025-06-10 09:21:56 +03:00
1a3b5e80f7 Add in-build ggml::ggml ALIAS library (ggml/1260)
Enable uniform linking with subproject and with find_package.
2025-06-10 09:21:56 +03:00
1f63e75f3b metal : use less stack memory in FA kernel (#14088)
* metal : use less stack memory in FA kernel

ggml-ci

* cont : fix BF16 variant
b5618
2025-06-09 23:05:02 +03:00
40cbf571c9 kv-cache : fix shift and defrag logic (#14081)
* kv-cache : fix shift

ggml-ci

* cont : reset shift[i]

ggml-ci

* cont : fix defrag erasing cells that didn't move

ggml-ci
b5617
2025-06-09 23:04:35 +03:00
7f4fbe5183 llama : allow building all tests on windows when not using shared libs (#13980)
* llama : allow building all tests on windows when not using shared libraries

* add static windows build to ci

* tests : enable debug logs for test-chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5616
2025-06-09 20:03:09 +02:00
f470bc36be ggml-cpu : split arch-specific implementations (#13892)
* move ggml-cpu-aarch64 to repack

* split quantize_row_q8_0/1

* split helper functions

* split ggml_vec_dot_q4_0_q8_0

* split ggml_vec_dot_q4_1_q8_1

* split ggml_vec_dot_q5_0_q8_0

* split ggml_vec_dot_q5_1_q8_1

* split ggml_vec_dot_q8_0_q8_0

* split ggml_vec_dot_tq1_0_q8_K

* split ggml_vec_dot_tq2_0_q8_K

* split ggml_vec_dot_q2_K_q8_K

* split ggml_vec_dot_q3_K_q8_K

* split ggml_vec_dot_q4_K_q8_K

* split ggml_vec_dot_q5_K_q8_K

* split ggml_vec_dot_q6_K_q8_K

* split ggml_vec_dot_iq2_xxs_q8_K

* split ggml_vec_dot_iq2_xs_q8_K

* split ggml_vec_dot_iq2_s_q8_K

* split ggml_vec_dot_iq3_xxs_q8_K

* split ggml_vec_dot_iq3_s_q8_K

* split ggml_vec_dot_iq1_s_q8_K

* split ggml_vec_dot_iq1_m_q8_K

* split ggml_vec_dot_iq4_nl_q8_0

* split ggml_vec_dot_iq4_xs_q8_K

* fix typos

* fix missing prototypes

* rename ggml-cpu-quants.c

* rename ggml-cpu-traits

* rename arm folder

* move cpu-feats-x86.cpp

* rename ggml-cpu-hbm

* update arm detection macro in quants.c

* move iq quant tables

* split ggml_quantize_mat_q8_0/K

* split ggml_gemv_*

* split ggml_gemm_*

* rename namespace aarch64 to repack

* use weak aliases to replace test macros

* rename GGML_CPU_AARCH64 to GGML_CPU_REPACK

* rename more aarch64 to repack

* clean up rebase leftover

* fix compilation errors

* remove trailing spaces

* try to fix clang compilation errors

* try to fix clang compilation errors again

* try to fix clang compilation errors, 3rd attempt

* try to fix clang compilation errors, 4th attempt

* try to fix clang compilation errors, 5th attempt

* try to fix clang compilation errors, 6th attempt

* try to fix clang compilation errors, 7th attempt

* try to fix clang compilation errors, 8th attempt

* try to fix clang compilation errors, 9th attempt

* more cleanup

* fix compilation errors

* fix apple targets

* fix a typo in arm version of ggml_vec_dot_q4_K_q8_K

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b5615
2025-06-09 16:47:13 +02:00
8f47e25f56 cuda : fix device sync on buffer clear (#14033) b5614 2025-06-09 16:36:26 +02:00
201b31dc2e graph : fix geglu (#14077)
ggml-ci
b5613
2025-06-09 17:17:31 +03:00
e21d2d4ae2 CANN: Simplify the environment variable setting(#13104)
* Simplify the environment variable setting to specify the memory pool type.

* Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.

* update

* fix CI

* update

* delete whitespace

* fix according to review

* update CANN.md

* update CANN.md
b5612
2025-06-09 19:47:39 +08:00
dc0623fddb webui: fix sidebar being covered by main content (#14082)
* webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* webui: update index.html.gz

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-09 12:01:17 +02:00
87d34b381d server : fix LRU check (#14079)
ggml-ci
b5610
2025-06-09 12:57:58 +03:00
b460d16ae8 sycl: Add reorder to Q6_K mmvq implementation (#13885)
* Add Reorder to Q6_K mmvq implementation

* Address PR comments: clean up comments

* Remove unused parameter after refactoring q4_k

* Adding inline to function and removing unnecessary reference to int

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b5609
2025-06-09 11:47:07 +02:00
91a8ee6a6f add geglu activation function (#14074)
Co-authored-by: dinhhuy <huy.dinh@brains-tech.co.jp>
b5608
2025-06-09 05:15:31 +01:00
056eb74534 CANN: Enable labeler for Ascend NPU (#13914) 2025-06-09 11:20:06 +08:00
247e5c6e44 cuda : fix buffer type check with integrated GPUs (#14069) b5606 2025-06-08 11:39:56 -07:00
5787b5da57 ci: add LoongArch cross-compile build (#13944) 2025-06-07 10:39:11 -03:00
228f34c9ce SYCL: Implement few same quantized type copy kernels (#13739)
* SYCL: Implement few same quantized type copy kernels

* Use memcpy for copying contiguous tensors

ggml-ci

* feat(sycl): add contiguous tensor copy support and device checks

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

* refactor: replace specific block copy functions with template

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

* Exclude BF16 support for COPY tensors for now
ggml-ci

* perf: adjust SYCL copy kernel block sizes for efficiency

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
b5604
2025-06-07 18:58:20 +05:30
0974ad7a7c llama : fix llama_model_chat_template with template name (LLM_KV with suffix) (#14050) b5603 2025-06-07 14:13:12 +02:00
745aa5319b llama : deprecate llama_kv_self_ API (#14030)
* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci
b5602
2025-06-06 14:11:15 +03:00
487a5e0401 context : fix SWA-related warning for multiple sequences (#14045) b5601 2025-06-06 13:29:18 +03:00
d17a809ef0 llama : support multiple classifier outputs and labels (#13940) b5600 2025-06-06 09:03:25 +02:00
1caae7fc6c gguf-py : add add_classifier_output_labels method to writer (#14031)
* add add_classifier_output_labels

* use add_classifier_output_labels
2025-06-05 17:42:31 +02:00
669c13e0f6 vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
b5598
2025-06-05 16:00:29 +02:00
146b88e8b3 ci: fix CUDA build failure on autodl cloud machines (#14005)
Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection
as 'native' fails on autodl cloud environments.

Co-authored-by: pockers21 <liyang2@uniontech.com>
2025-06-05 16:25:29 +03:00
7f37b6cf1e memory : migrate from llama_kv_cache to more generic llama_memory (#14006)
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API

ggml-ci

* context : fix casts

ggml-ci
b5596
2025-06-05 15:29:22 +03:00
3a077146a4 llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013) b5595 2025-06-05 11:57:42 +02:00
d01d112abb readme : add badge (#13938) 2025-06-05 10:50:55 +03:00
9f47fa5792 vocab : warn about missing mask token (#14022) b5593 2025-06-05 09:29:18 +02:00
9e31bec4fd context : fix pos_min initialization upon error decode (#14008)
ggml-ci
b5592
2025-06-05 09:06:29 +03:00
5a8ae3053c vulkan: automatically deduce size of push constants (#13936) b5591 2025-06-05 07:17:58 +02:00
0d3984424f ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
b5590
2025-06-04 22:02:00 +02:00
3e63a58ef7 kv-cache : refactor the update/defrag mechanism (#13988)
* kv-cache : refactor update mechanism

ggml-ci

* memory : improve status handling

* defrag : reset head + add comments

ggml-ci

* cont : minor fixes

ggml-ci
b5589
2025-06-04 18:58:20 +03:00
2589ad3704 ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997) b5588 2025-06-04 15:37:40 +02:00
482548716f releases : use dl backend for linux release, remove arm64 linux release (#13996) b5587 2025-06-04 13:15:54 +02:00
3ac67535c8 llama-graph : use ggml_repeat_4d (#13998) b5586 2025-06-04 10:11:26 +02:00
0b4be4c435 CUDA: fix FTZ in FA for Gemma 3 (#13991) b5585 2025-06-04 08:57:05 +02:00
e0e806f52e kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)
ggml-ci
b5584
2025-06-04 09:50:32 +03:00
7e00e60ef8 vulkan: fix warnings in perf logger querypool code (#13937) 2025-06-03 20:30:22 +02:00
ea1431b0fa docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
2025-06-03 13:09:36 +02:00
71e74a3ac9 opencl: add backend_synchronize (#13939)
* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.
b5581
2025-06-02 16:54:58 -07:00
bfb1e012a0 OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)
* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes
b5580
2025-06-02 16:53:36 -07:00
3637576288 server : disable speculative decoding for SWA models (#13970)
* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models
b5579
2025-06-02 21:34:40 +03:00