Commit Graph

5752 Commits

Author SHA1 Message Date
ab46d11de5 Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
2025-06-22 19:21:19 +05:30
a234e09f41 GGML: increase OP count in assertion 2025-06-22 10:37:26 +05:30
35dacd1a93 ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
2025-06-22 10:37:26 +05:30
a9aedf46b4 SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate 2025-06-22 10:37:26 +05:30
34d1aedafb Vulkan: Add GLU ops and shaders 2025-06-22 10:37:25 +05:30
d5934297ef update comment [no ci]
ggml-ci
2025-06-22 10:37:25 +05:30
0b2703fc57 implement swapped variants (cpu/cuda) 2025-06-22 10:37:25 +05:30
f8705a2399 64bit multiplication [no ci] 2025-06-22 10:37:25 +05:30
70e8b48e6a more constraints and use 64bit ints
ggml-ci
2025-06-22 10:37:25 +05:30
cfa9c7a47a add CUDA_GLU_BLOCK_SIZE [no ci] 2025-06-22 10:37:25 +05:30
d9ddeb9dfd metal : add glu kernels
ggml-ci
2025-06-22 10:37:24 +05:30
a341aa3c2b refactor into GGML_GLU_OP 2025-06-22 10:37:24 +05:30
f8c20809de tighten constraints again 2025-06-22 10:37:24 +05:30
a1a7b6dfa9 implement unary REGLU/GEGLU/SWIGLU cuda ops 2025-06-22 10:37:24 +05:30
bb2fda70ae special case gated ops 2025-06-22 10:37:24 +05:30
21c4963bd3 fix ggml_vec_geglu_f16 2025-06-22 10:37:24 +05:30
56c7993171 duplicate shape of source 2025-06-22 10:37:23 +05:30
5a490f07a2 relax constraints 2025-06-22 10:37:23 +05:30
76c9bc1731 implement unary REGLU/GEGLU/SWIGLU cpu ops 2025-06-22 10:37:23 +05:30
aa064b2eb7 CUDA: add mean operation (#14313)
* CUDA: add mean operation

* add back sum_rows_f32_cuda

* Review: early exit if col!=0
b5733
2025-06-22 12:39:54 +08:00
aa0ef5c578 gguf-py : fix Qwen3-Embedding eos token (#14314) 2025-06-21 18:12:05 +02:00
bb16041cae Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
* Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.

* remove #ifdef for debug utils and add queue marker.
b5731
2025-06-21 08:17:12 +02:00
58cba76a9a gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312) 2025-06-21 07:33:21 +02:00
67ae5312e2 metal : fix thread-safety (#14300)
ggml-ci
b5729
2025-06-21 08:04:18 +03:00
692e3cdd0a memory : rename interface to llama_memory_context_i (#14296)
* memory : rename interface to llama_memory_context_i

ggml-ci

* cont : fix comments

* cont : use "mctx" for referencing a memory context

ggml-ci
b5728
2025-06-21 08:03:46 +03:00
b23fa0b3f4 convert : fix Llama 4 conversion (#14311) 2025-06-21 06:32:01 +02:00
06cbedfca1 sync : ggml
ggml-ci
b5726
2025-06-20 21:02:47 +03:00
b7147673f2 Add ggml_roll (ggml/1274)
* ggml : add ggml_roll

* use set/get_op_params & std::min
2025-06-20 21:02:47 +03:00
d860dd99a4 docs : fix the link to llama.h (#14293) 2025-06-20 19:43:35 +02:00
c959f462a0 CUDA: add conv_2d_transpose (#14287)
* CUDA: add conv_2d_transpose

* remove direct include of cuda_fp16

* Review: add brackets for readability, remove ggml_set_param and add asserts
b5723
2025-06-20 22:48:24 +08:00
22015b2092 lint : remove trailing whitepace (#14304) b5722 2025-06-20 16:37:44 +02:00
dd6e6d0b6a vocab : prevent tokenizer overflow (#14301)
* vocab : prevent stack overflow in tokenize

* vocab : return error instead of aborting on oversized token count

* vocab : INT32_MIN from llama_tokenize on overflow
b5721
2025-06-20 07:13:06 -07:00
8308f98c7f sycl: add usage of enqueue_functions extension (#14244)
* Add header and namespace to use enqueue_functions extension

* Convert submit and parallel_for to use new extension in convert.cpp

* Convert submit and parallel_for to use extension in ggml-sycl.cpp

* Convert submit and parallel_for to use extension in gla.cpp

* Convert submit and parallel_for in mmq.cpp

* Convert submit and parallel_for in mmvq.cpp

* Convert submit and parallel_for in remaining files

* Convert all simple parallel_for to nd_launch from enqueue_functions
extension

* Wrapping extension in general function

Create a general function that enable the enqueue_functions extension if
it is enable in the compiler, otherwise call the general SYCL function
to launch kernels.

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b5720
2025-06-20 15:07:21 +02:00
6369be0735 Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
* Add PowerPC feature detection and scoring

* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC

* ggml-cpu: Delay some initializations until function is called

When using GGML_BACKEND_DL=ON, these initializations might use
instructions that are not supported by the current CPU.

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
b5719
2025-06-20 14:17:32 +02:00
88fc854b4b llama : improve sep token handling (#14272) b5718 2025-06-20 14:04:09 +02:00
e28c1b93fd cuda : synchronize graph capture and cublas handle destruction (#14288)
Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread
b5717
2025-06-20 13:57:36 +02:00
d27b3ca175 ggml : fix repack work size for mul_mat_id (#14292)
ggml-ci
b5716
2025-06-20 11:19:15 +03:00
9230dbe2c7 ggml: Update KleidiAI to v1.9.0 (#14277) b5715 2025-06-20 10:51:01 +03:00
812939a9e9 model : more uniform output id handling (#14275)
* model : more uniform output id handling

ggml-ci

* cont : revert n_outputs < n_tokens optimization

ggml-ci

* cont : fix out_ids initialization

ggml-ci
b5714
2025-06-20 10:50:27 +03:00
4c9fdfbe15 ubatch : new splitting logic (#14217)
ggml-ci
b5713
2025-06-20 10:14:14 +03:00
9eaa51e7f0 CUDA: add conv_2d_dw (#14265)
* CUDA: add conv_2d_dw

* better naming

* simplify using template

* Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const
b5712
2025-06-20 09:50:24 +08:00
8f71d0f3e8 ggml-cpu : remove unnecesary arm feature detection (#14281)
Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.
b5711
2025-06-19 21:24:14 +02:00
381174bbda gguf-py : make sentencepiece optional (#14200)
* Make sentencepiece optional

* Bump to 0.18.0

* Bump patch instead of minor

Co-authored-by: compilade <git@compilade.net>

---------

Co-authored-by: compilade <git@compilade.net>
gguf-v0.17.1
2025-06-19 15:56:12 +02:00
d67341dc18 server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 <27946957+aa956@users.noreply.github.com>
b5709
2025-06-19 16:01:03 +03:00
456af35eb7 build : suppress gcc15 compile warnings (#14261)
* Change _contains_any() substrs to std::string_view and fix the find comparison logic.
b5708
2025-06-19 14:49:48 +02:00
600e3e9b50 sycl: Cleanup codepaths in Get Rows in sycl backend (#14215)
Addresses unused reorder path
b5707
2025-06-19 11:40:21 +01:00
fffcce535e llama-bench : add --no-warmup flag (#14224) (#14270)
Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

- Add no_warmup boolean field to cmd_params struct

- Add --no-warmup command-line argument parsing

- Add help text documentation for the new flag

- Wrap existing warmup logic in conditional check

- Maintain full backward compatibility (warmup enabled by default)

Addresses #14224
b5706
2025-06-19 12:24:12 +02:00
5fc7856815 convert : fix remote option in Windows (#14100) 2025-06-19 12:21:40 +02:00
faed5a5f5d llamafile : support s390x SIMD instruction set (#14273) b5704 2025-06-19 11:48:54 +02:00
10bb545c5b Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249) b5703 2025-06-19 09:15:42 +02:00