llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-26 19:55:04 +00:00

Files

Jeff Bolz dc1d2adfc0 vulkan: scalar flash attention implementation (#13324 )

* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA

2025-05-10 08:07:07 +02:00

actions

ci : move release workflow to a separate file (#13362 )

2025-05-08 13:15:28 +02:00

ISSUE_TEMPLATE

repo : update links to new url (#11886 )

2025-02-15 16:40:57 +02:00

workflows

vulkan: scalar flash attention implementation (#13324 )

2025-05-10 08:07:07 +02:00

labeler.yml

llama : move end-user examples to tools directory (#13249 )

2025-05-02 20:27:13 +02:00

pull_request_template.md

repo : update links to new url (#11886 )

2025-02-15 16:40:57 +02:00