Files
llama.cpp/src/llama-memory-recurrent.h

189 lines
5.6 KiB
C
Raw Normal View History

#pragma once
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-memory.h"
#include <set>
#include <vector>
//
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// llama_memory_recurrent
//
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// TODO: extract the cache state used for graph computation into llama_memory_recurrent_state_i
// see the implementation of llama_kv_cache_unified_state_i for an example how to do it
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
class llama_memory_recurrent : public llama_memory_i {
public:
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// this callback is used to filter out layers that should not be included in the cache
using layer_filter_cb = std::function<bool(int32_t il)>;
llama_memory_recurrent(
const llama_model & model,
layer_filter_cb && filter,
ggml_type type_r,
ggml_type type_s,
bool offload,
uint32_t mem_size,
uint32_t n_seq_max);
~llama_memory_recurrent() = default;
//
// llama_memory_i
//
llama_memory_state_ptr init_batch(
const llama_batch & batch,
uint32_t n_ubatch,
bool embd_all) override;
llama_memory_state_ptr init_full() override;
llama_memory_state_ptr init_update(llama_context * lctx, bool optimize) override;
void clear(bool data) override;
bool seq_rm (llama_seq_id seq_id, llama_pos p0, llama_pos p1) override;
void seq_cp (llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1) override;
void seq_keep(llama_seq_id seq_id) override;
void seq_add (llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos shift) override;
void seq_div (llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d) override;
llama_pos seq_pos_min(llama_seq_id seq_id) const override;
llama_pos seq_pos_max(llama_seq_id seq_id) const override;
bool prepare(const std::vector<llama_ubatch> & ubatches);
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// find a contiguous slot of memory cells and emplace the ubatch there
bool find_slot(const llama_ubatch & ubatch);
bool get_can_shift() const override;
// state write/load
void state_write(llama_io_write_i & io, llama_seq_id seq_id = -1) const override;
void state_read (llama_io_read_i & io, llama_seq_id seq_id = -1) override;
uint32_t head = 0; // the location where the batch will be placed in the cache (see find_slot())
uint32_t size = 0; // total number of cells, shared across all sequences
uint32_t used = 0; // used cells (i.e. at least one seq_id)
// computed before each graph build
uint32_t n = 0;
// first zero-ed state
int32_t rs_z = -1;
// TODO: optimize for recurrent state needs
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
struct mem_cell {
llama_pos pos = -1;
int32_t src = -1; // used to know where states should be copied from
int32_t src0 = -1; // like src, but only used when setting the inputs (allowing to copy once)
int32_t tail = -1;
std::set<llama_seq_id> seq_id;
bool has_seq_id(const llama_seq_id & id) const {
return seq_id.find(id) != seq_id.end();
}
bool is_empty() const {
return seq_id.empty();
}
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
bool is_same_seq(const mem_cell & other) const {
return seq_id == other.seq_id;
}
};
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
std::vector<mem_cell> cells;
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// per layer
std::vector<ggml_tensor *> r_l;
std::vector<ggml_tensor *> s_l;
private:
//const llama_model & model;
const llama_hparams & hparams;
const uint32_t n_seq_max = 1;
std::vector<ggml_context_ptr> ctxs;
std::vector<ggml_backend_buffer_ptr> bufs;
size_t total_size() const;
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
size_t size_r_bytes() const;
size_t size_s_bytes() const;
void state_write_meta(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges, llama_seq_id seq_id = -1) const;
void state_write_data(llama_io_write_i & io, const std::vector<std::pair<uint32_t, uint32_t>> & cell_ranges) const;
bool state_read_meta(llama_io_read_i & io, uint32_t cell_count, llama_seq_id dest_seq_id = -1);
bool state_read_data(llama_io_read_i & io, uint32_t cell_count);
};
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
class llama_memory_recurrent_state : public llama_memory_state_i {
public:
// used for errors
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
llama_memory_recurrent_state(llama_memory_status status);
// used to create a full-cache state
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
llama_memory_recurrent_state(
llama_memory_recurrent * mem);
// used to create a state from a batch
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
llama_memory_recurrent_state(
llama_memory_recurrent * mem,
llama_sbatch sbatch,
std::vector<llama_ubatch> ubatches);
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
virtual ~llama_memory_recurrent_state();
//
// llama_memory_state_i
//
bool next() override;
bool apply() override;
std::vector<int64_t> & out_ids() override;
llama_memory_status get_status() const override;
const llama_ubatch & get_ubatch() const override;
//
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
// llama_memory_recurrent_state specific API
//
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
uint32_t get_n_rs() const;
uint32_t get_head() const;
int32_t get_rs_z() const;
uint32_t get_size() const;
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
ggml_tensor * get_r_l(int32_t il) const;
ggml_tensor * get_s_l(int32_t il) const;
int32_t s_copy(int i) const;
private:
const llama_memory_status status;
memory : Hybrid recurrent cache (#13979) * feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069 Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove llama_model_is_hybrid_Recurrent public API https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-19 00:08:14 -05:00
llama_memory_recurrent * mem;
llama_sbatch sbatch;
size_t i_next = 0;
std::vector<llama_ubatch> ubatches;
//
// data needed for building the compute graph for the current ubatch:
// TODO: extract all the state like `head` and `n` here
//
const bool is_full = false;
};