llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-07 17:24:18 -04:00

Author	SHA1	Message	Date
Gabe Goodhart	d8c929ff5d	feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:19 -06:00
Gabe Goodhart	9c1a604af8	fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00
Gabe Goodhart	911e694476	fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00
Gabe Goodhart	d3699366e6	fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00
Gabe Goodhart	cf03d4ae5c	fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00
Gabe Goodhart	6c6ec0003a	fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00
Gabe Goodhart	c71eaa37a0	feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-06-17 14:54:18 -06:00

7 Commits