llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-07-15 23:30:15 +00:00

Files

Jeff Bolz 6efcd65945 vulkan: optimize flash attention split_k_reduce (#14554 )

* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).

2025-07-08 20:11:42 +02:00

cmake

ggml-cpu : rework weak alias on apple targets (#14146 )

2025-06-16 13:54:15 +08:00

include

CUDA: add bilinear interpolation for upscale (#14563 )

2025-07-08 10:11:18 +08:00

src

vulkan: optimize flash attention split_k_reduce (#14554 )

2025-07-08 20:11:42 +02:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : remove kompute backend (#14501 )

2025-07-03 07:48:32 +03:00