* vulkan: allow FA split_k with smaller KV values
* vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).