CUDA: quantized KV support for FA vec (#7527)

* CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code
2025-07-30 14:13:57 -04:00 · 2024-06-01 08:44:14 +02:00
parent a323ec60af
commit 9b596417af
110 changed files with 2697 additions and 1200 deletions
--- a/ggml-cuda/fattn-tile-f32.cu
+++ b/ggml-cuda/fattn-tile-f32.cu
@@ -36,6 +36,9 @@ static __global__ void flash_attn_tile_ext_f32(
        const int nb11,
        const int nb12,
        const int nb13,
+        const int nb21,
+        const int nb22,
+        const int nb23,
        const int ne0,
        const int ne1,
        const int ne2,