Commit Graph

1075 Commits

Author SHA1 Message Date
chen fan
14c28dfc50 CANN: weight format to NZ for Ascend310P3 (#14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-23 11:58:00 +08:00
Aman Gupta
8c988fa41d CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
Jeff Bolz
84712b6043 vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817) 2025-07-22 17:35:21 +02:00
Sigbjørn Skjæret
e28c0b80c2 cuda : implement bf16 cpy ops and enable bf16 cont (#14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-22 12:33:10 +02:00
lhez
8e6f8bc875 opencl: remove unreachable return (#14806) 2025-07-22 08:53:30 +02:00
R0CKSTAR
48b86c4fdb cuda: remove linking to cublasLt (#14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-22 07:45:26 +08:00
Sigbjørn Skjæret
38d3af1b73 opencl: fix im2col when KW!=KH (#14803) 2025-07-21 13:55:10 -07:00
rmatif
6c9ee3b17e opencl: add conv2d kernel (#14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-21 10:03:19 -07:00
Romain Biessy
cd465d823c sycl: Fix im2col (#14797) 2025-07-21 18:39:29 +02:00
Charles Xu
922042601b kleidiai: add support for get_rows (#14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-21 16:49:52 +03:00
Jeff Bolz
c2e058f1b4 vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi
a979ca22db ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
Peter0x44
d4b91ea7b2 vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (#14707) 2025-07-19 17:58:03 +02:00
0cc4m
83f5872404 Vulkan: Fix fprintf format-security warning (#14770) 2025-07-19 17:47:53 +02:00
Georgi Gerganov
bf9087f59a metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
Oliver Simons
021cc28bef cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (#14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-18 04:35:32 -07:00
Aman Gupta
f9a31eea06 CUDA: set_rows + cpy.cu refactor (#14712) 2025-07-18 14:54:18 +08:00
Neo Zhang Jianyu
349ea79fce use max work group size for device to replace the magic number (#14732) 2025-07-18 10:23:14 +08:00
Reese Levine
21c021745d ggml: Add initial WebGPU backend (#14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-16 18:18:51 +03:00
Georgi Gerganov
225e7a1438 llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Georgi Gerganov
64978340b0 ggml : add asserts (#14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-16 14:43:32 +03:00
Jeff Bolz
ba1ceb3456 vulkan: fix noncontig check for mat_mul_id splitting (#14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-15 21:51:09 +02:00
Jeff Bolz
10a0351a97 vulkan: add RTE variants for glu/add/sub/mul/div (#14653) 2025-07-15 21:32:11 +02:00
R0CKSTAR
cbc68be51d cuda: fix build warnings in set-rows.cu (unused variable) (#14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-15 15:28:53 +08:00
Anton Mitkov
bdca38376f sycl: Hotfix for non dnnl codepath (#14677) 2025-07-14 18:12:42 +01:00
shalinib-ibm
55c509daf5 ggml : refactor llamafile_sgemm PPC code (#14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-14 16:16:42 +03:00
Akarshan Biswas
0f4c6ec0f1 SYCL: use 1D kernel for set_rows (#14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-14 10:37:55 +01:00
Anton Mitkov
65a3ebb0aa sycl: Batched mulmat rework for oneDNN dispatch (#14617) 2025-07-14 10:37:35 +01:00
Sigbjørn Skjæret
923e3ea2e3 cuda : add set rows for bf16 (#14664) 2025-07-13 15:01:24 +02:00
Yavor Ivanov
e743cddb60 cuda : add ELU support (#14657) 2025-07-13 11:33:16 +02:00
Georgi Gerganov
05fec5bd29 ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
2025-07-13 10:36:33 +03:00
Yavor Ivanov
dcf7f2ea3c metal : Add missing unary ops Metal support (#14660) 2025-07-13 08:38:13 +03:00
Aman Gupta
7de5c7cab6 CUDA: add set rows for f32 and f16 (#14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-12 16:31:38 +03:00
Georgi Gerganov
3120413ccd vulkan : remove unused vars (#0)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly
74bb294591 vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
2025-07-12 14:25:44 +03:00
Acly
3e303b1107 vulkan : implement ggml_roll (ggml/1290)
ggml-ci
2025-07-12 14:25:44 +03:00
Jeff Bolz
b3ad3a0191 vulkan: support SET_ROWS (#14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 12:12:26 +02:00
Jeff Bolz
98197e5c98 vulkan: optimizations for deepseek prompt processing (#14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 11:51:58 +02:00
Tarek Dakhran
f5e96b368f model : support LiquidAI LFM2 hybrid family (#14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-11 20:27:01 +02:00
Slobodan Josic
756aa1020a HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634) 2025-07-11 18:55:00 +02:00
rmatif
6bdda13981 opencl: add tiled mul_mat_f16_f32 (#14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-10 14:58:12 -07:00
lhez
0b8855775c opencl: add set_rows for f16 and f32 (#14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-10 11:48:52 -07:00
Akarshan Biswas
704bb7a71c SYCL: Initial set_rows kernel implementation (#14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-10 09:29:38 +01:00
compilade
a57d1bcb3c cuda : support Falcon-H1 state size for SSM_SCAN (#14602) 2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen
98bab638fb ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Miaoqian Lin
26a48ad699 ggml : prevent integer overflow in gguf tensor size calculation (#14595) 2025-07-09 14:33:53 +02:00
Jeff Bolz
6efcd65945 vulkan: optimize flash attention split_k_reduce (#14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-08 20:11:42 +02:00
Jeff Bolz
b8eeb8741d vulkan : fix rope with partial rotation and non-cont src (#14582) 2025-07-08 15:21:21 +02:00
Georgi Gerganov
4d0dcd4a06 cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-08 10:15:21 +03:00
Aman Gupta
75c91de6e9 CUDA: add bilinear interpolation for upscale (#14563) 2025-07-08 10:11:18 +08:00