Gabe Goodhart
d8c929ff5d
feat: Allow custom layer filters for hybrid recurrent
...
This should help support architectures like Falcon H1 where there is
overlap between layers that need attention and recurrent caches.
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:19 -06:00
Gabe Goodhart
9c1a604af8
fix: Update clear signature for data argument after rebase
...
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00
Gabe Goodhart
911e694476
fix: Fix status for init_update sig for recurrent cache state
...
Branch: GraniteFour
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00
Gabe Goodhart
d3699366e6
fix: Update recurrent cache for changes to remove intermediate kv_cache interface
...
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00
Gabe Goodhart
cf03d4ae5c
fix: Fix shift logic to defer to unified cache
...
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00
Gabe Goodhart
6c6ec0003a
fix: Fix wrong bool condition for split equal in hybrid cache
...
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00
Gabe Goodhart
c71eaa37a0
feat: First pass at llama_kv_cache_hybrid_recurrent
...
This follows the pattern in iswa where the two child caches are held
explicitly to support the case where a model requires a single attention
cache and a single recurrent cache where each layer uses exactly one of the
caches.
This is a rewrite of the more generic approach in the original hybrid cache
PR: https://github.com/ggml-org/llama.cpp/pull/13276
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
2025-06-17 14:54:18 -06:00