CUDA: use mma PTX instructions for FlashAttention (#11583)

* CUDA: use mma PTX instructions for FlashAttention

* __shfl_sync workaround for movmatrix

* add __shfl_sync to HIP

Co-authored-by: Diego Devesa <slarengh@gmail.com>
This commit is contained in:
Johannes Gäßler
2025-02-02 19:31:09 +01:00
committed by GitHub
parent 84ec8a58f7
commit 864a0b67a6
29 changed files with 2058 additions and 998 deletions

File diff suppressed because it is too large Load Diff