Georgi Gerganov
d498af3d5a
graph : avoid huge warm-up graphs for MoE models ( #14753 )
...
* graph : avoid huge warm-up graphs for MoE models
ggml-ci
* cont : bump max nodes to 8x model tensors
2025-07-18 14:31:15 +03:00
Georgi Gerganov
8f974bc1e9
graph : refactor context to not pass gf explicitly ( #14629 )
...
ggml-ci
2025-07-18 08:29:28 +03:00
Georgi Gerganov
01612b7409
llama : reuse compute graphs ( #14482 )
...
* llama : reuse compute graphs
ggml-ci
* llama-bench : add graph reuse parameter
ggml-ci
* cont : remove the parameter and the sched resets
ggml-ci
* graph : rename update() to can_reuse()
ggml-ci
* params : remove is_same()
ggml-ci
* graph : set res->params in llm_graph_context constructor
ggml-ci
* graph : avoid set_max_nodes in llm_graph_result
ggml-ci
* kv-cache : reuse llama_context's graph result instance
ggml-ci
* context : reset the previous graph result upon memory updates
ggml-ci
* batch : llama_ubatch now carries its data instead of pointing to balloc
ggml-ci
* merge : fix build
ggml-ci
* graph : fix can_reuse() checks when flash-attention is disabled
* graph : move llm_graph_result impl in source file + debug env
ggml-ci
2025-07-17 19:08:33 +03:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-07-16 16:35:42 +03:00
Aman Gupta
9c9e4fc635
llama-context: add ability to get logits ( #14672 )
2025-07-14 21:01:41 +08:00
Georgi Gerganov
7b50d589a8
kv-cells : fix tracking of seq_pos ( #14339 )
...
* kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
* cont : improve error message
ggml-ci
* cont : add more comments
2025-06-23 12:27:35 +03:00
Georgi Gerganov
692e3cdd0a
memory : rename interface to llama_memory_context_i ( #14296 )
...
* memory : rename interface to llama_memory_context_i
ggml-ci
* cont : fix comments
* cont : use "mctx" for referencing a memory context
ggml-ci
2025-06-21 08:03:46 +03:00
Georgi Gerganov
4c9fdfbe15
ubatch : new splitting logic ( #14217 )
...
ggml-ci
2025-06-20 10:14:14 +03:00
Georgi Gerganov
d3e64b9f49
llama : rework embeddings logic ( #14208 )
...
* llama : rework embeddings logic
ggml-ci
* cont : fix rerank
ggml-ci
* cont : engrish [no ci]
* cont : fix rerank
ggml-ci
* server : support both embeddings and completions with single model
ggml-ci
* cont : avoid embeddings_org
ggml-ci
2025-06-16 14:14:00 +03:00
Georgi Gerganov
c311ac664d
cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ ( #14188 )
...
ggml-ci
2025-06-15 10:08:58 +03:00
Georgi Gerganov
b9912ac570
batch : auto-gen positions + verify multi-sequence input ( #14177 )
...
* batch : verify multi-sequence input batches
ggml-ci
* cont : auto-gen positions + verify multi-seq input
ggml-ci
* cont : first print debug info, then perform validation
ggml-ci
* cont : fix position auto-gen + add comments
ggml-ci
2025-06-15 09:18:37 +03:00
Georgi Gerganov
60c666347b
batch : rework llama_batch_allocr ( #14153 )
...
* batch : rework llama_batch_allocr
ggml-ci
* cont : move validation inside class
ggml-ci
* cont : move output counting to class
ggml-ci
* cont : minor
ggml-ci
* batch : add TODOs
ggml-ci
2025-06-13 13:47:55 +03:00
Georgi Gerganov
f6e1a7aa87
context : simplify output counting logic during decode ( #14142 )
...
* batch : remove logits_all flag
ggml-ci
* context : simplify output counting logic during decode
ggml-ci
* cont : fix comments
2025-06-12 11:50:01 +03:00
Georgi Gerganov
c3ee46fab4
batch : remove logits_all flag ( #14141 )
...
ggml-ci
2025-06-12 11:49:26 +03:00
Georgi Gerganov
9596506965
kv-cache : fix split_equal handling in unified implementation ( #14130 )
...
ggml-ci
2025-06-12 10:02:15 +03:00
compilade
a20b2b05bc
context : round n_tokens to next multiple of n_seqs when reserving ( #14140 )
...
This fixes RWKV inference which otherwise failed
when the worst case ubatch.n_seq_tokens rounded to 0.
2025-06-12 02:56:04 -04:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov
487a5e0401
context : fix SWA-related warning for multiple sequences ( #14045 )
2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret
d17a809ef0
llama : support multiple classifier outputs and labels ( #13940 )
2025-06-06 09:03:25 +02:00
Georgi Gerganov
7f37b6cf1e
memory : migrate from llama_kv_cache to more generic llama_memory ( #14006 )
...
* memory : merge llama_kv_cache into llama_memory + new `llama_memory` API
ggml-ci
* context : fix casts
ggml-ci
2025-06-05 15:29:22 +03:00
Georgi Gerganov
9e31bec4fd
context : fix pos_min initialization upon error decode ( #14008 )
...
ggml-ci
2025-06-05 09:06:29 +03:00
Georgi Gerganov
3e63a58ef7
kv-cache : refactor the update/defrag mechanism ( #13988 )
...
* kv-cache : refactor update mechanism
ggml-ci
* memory : improve status handling
* defrag : reset head + add comments
ggml-ci
* cont : minor fixes
ggml-ci
2025-06-04 18:58:20 +03:00
Georgi Gerganov
803f8baf4f
llama : deprecate explicit kv_self defrag/update calls ( #13921 )
...
ggml-ci
2025-05-31 15:58:33 +03:00
Georgi Gerganov
3600cc2886
llama : use n_swa + n_ubatch cells for SWA cache ( #13833 )
...
* llama : use n_swa + n_ubatch cells for SWA cache
ggml-ci
* llama : add warning about multi-sqeuence SWA contexts
2025-05-31 15:57:44 +03:00
Georgi Gerganov
3f55f781f1
llama : auto-batch preparation ( #13845 )
...
* llama : auto-batch
ggml-ci
* context : simplify if branching
2025-05-31 12:55:57 +03:00
Georgi Gerganov
12d0188c0d
kv-cache : refactor + add llama_memory_state_i ( #13746 )
...
* kv-cache : simplify the "struct llama_kv_cache" interface
ggml-ci
* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)
ggml-ci
* kv-cache : some comments
ggml-ci
* context : fix graph reserve for multiple sequences
ggml-ci
* kv-cache : fix typo [no ci]
* kv-cache : fix find_slot() logic for free slots
ggml-ci
* llama : add TODO for deprecating the defrag API in the future
* kv-cache : improve find_slot() using min/max seq pos info
ggml-ci
* llama : handle aborts and compute errors
ggml-ci
* memory : extract state into llama_memory_state
ggml-ci
* kv-cache : add comments
ggml-ci
* server : update batching logic to reset n_batch on successful decode
* server : upon full re-processing, remove the sequence from the cache
* kv-cache : add TODO for doing split_equal when split_simple fails
ggml-ci
2025-05-31 10:24:04 +03:00
Georgi Gerganov
4f81b33e32
llama : validate seq id batch input ( #13809 )
...
* llama : validate seq id batch input
ggml-ci
* cont : fix the fix
ggml-ci
2025-05-27 09:40:59 +03:00
Georgi Gerganov
79c137f776
examples : allow extracting embeddings from decoder contexts ( #13797 )
...
ggml-ci
2025-05-26 14:03:54 +03:00
Georgi Gerganov
de2ef53a4b
kv-cache : rework kv_cell ( #13706 )
...
* kv-cache : rework kv_cell
ggml-ci
* kv-cells : use "shift" instead of "delta" consistently
ggml-ci
* llama : add llama_max_parallel_sequences()
ggml-ci
* kv-cells : update comments [no ci]
* context : fail upon construction if sequences exceed max value
ggml-ci
* kv-cells : get_pos() -> pos_get() + comments
ggml-ci
* kv-cells : fix tracking of "used" cells
ggml-ci
2025-05-25 16:34:36 +03:00
Georgi Gerganov
797f2ac062
kv-cache : simplify the interface ( #13660 )
...
* kv-cache : simplify the interface
ggml-ci
* context : revert llama_batch_allocr position change
ggml-ci
2025-05-21 15:11:13 +03:00
Georgi Gerganov
a4090d1174
llama : remove llama_kv_cache_view API + remove deprecated ( #13653 )
...
ggml-ci
2025-05-20 16:13:16 +03:00
Georgi Gerganov
e298d2fbd0
kv-cache : add SWA support ( #13194 )
...
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
2025-05-20 08:05:46 +03:00
Sigbjørn Skjæret
f5170c1d7a
editorconfig : fix trailing whitespace from #13542 ( #13546 )
2025-05-14 21:22:49 +03:00
Gilad S.
017f10b5fa
fix: crash when calling llama_state_get_size
on a context without a KV cache ( #13542 )
2025-05-14 19:18:18 +03:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
Georgi Gerganov
064cc596ac
context : fix state io for memory-less contexts ( #13470 )
...
ggml-ci
2025-05-12 15:12:27 +03:00
David Huang
7f323a589f
Add --no-op-offload
to improve -ot
pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
Georgi Gerganov
6562e5a4d6
context : allow cache-less context for embeddings ( #13108 )
...
* context : allow cache-less context for embeddings
ggml-ci
* context : enable reranking with encode()
ggml-ci
* context : encode() clears embd_seq
ggml-ci
* examples : use llama_encode() when appropriate
ggml-ci
* models : nomic bert moe does not require KV cache
* llama : update comments for llama_decode/llama_encode
ggml-ci
* context : update warning log [no ci]
2025-05-08 14:28:33 +03:00
Georgi Gerganov
51fb96b1ff
context : remove logits_all flag ( #13284 )
...
* context : remove logits_all flag
ggml-ci
* llama : remove logits_all flag + reorder llama_context_params
ggml-ci
2025-05-08 14:26:50 +03:00
Georgi Gerganov
a75cb30dc9
context : fix reorder logic ( #13267 )
...
ggml-ci
2025-05-02 20:54:13 +03:00
Georgi Gerganov
c642bc014c
kv-cache : separate recurrent vs non-recurrent impl ( #12799 )
...
* kv-cache : serparate recurrent vs non-recurrent impl (wip)
ggml-ci
* kv-cache : init -> contructor + add llama_memory_params
ggml-ci
* kv-cache : fix callback reference
ggml-ci
* context : llama_kv_cache -> llama_memory_i
ggml-ci
* context : move memory creation logic to model
ggml-ci
* llama : remove reference of memory during encode
ggml-ci
* kv-cache : hide padding details in the implementation
ggml-ci
* kv-cache : add ubatch_next()
ggml-ci
* context : simplify sbatch logic
ggml-ci
* kv-cache : hide defrag logic in the implementation
ggml-ci
* context : hide kv cache details in implementation
ggml-ci
* build : fix
ggml-ci
* cont : another fix
ggml-ci
* kv-cache : simplify interface (wip)
ggml-ci
* kv-cache : use separate KV cell structs for unified/recurrent
ggml-ci
* kv-cache : clean-up
ggml-ci
* model : better llama_model::create_model() signature
ggml-ci
* kv-cache : fix recurrent seq_rm()
ggml-ci
* kv-cache : replace `struct callbacks` with `llama_model &`
ggml-ci
* kv-cache : replace `struct graph_params` with `llama_context &`
ggml-ci
* kv-cache : fix offload check
ggml-ci
* context : avoid passing unique_ptr
ggml-ci
* kv-cache : avoid using the backends from the llama_context
ref #13113
ggml-ci
* kv-cache : more consistent debug logs [no ci]
* kv-cache : do not pass the full llama_context for kv graphs
ggml-ci
* kv-cache : remove comment
* kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext
ggml-ci
* kv-cache : fix recurrent multi-user case
ggml-ci
* memory : remove comments [no ci]
2025-05-02 17:48:36 +03:00
ddh0
16a457facd
fix typo: n_ctx_pre_seq
-> n_ctx_per_seq
( #13221 )
2025-04-30 21:28:43 +01:00
pockers21
fb0471d175
context : do not clear output buffer on reserve ( #13152 )
...
Co-authored-by: pockers21 <liyang2@uniontech.com >
2025-04-28 16:45:40 +03:00
Diego Devesa
295354ea68
llama : fix K-shift with quantized K and BLAS backend ( #13113 )
2025-04-25 19:40:11 +02:00
Georgi Gerganov
2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels ( #12953 )
...
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
2025-04-17 18:16:36 +03:00
Juk Armstrong
daa422881a
llama : DeepSeek V2/V3 MLA implementation ( #12801 )
...
* Merged using squash to remove all noise commit messages
* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large
* Removed 3 conts (2x RoPE and 1x RMS-norm)
* Changed to use `<cmath>` instead of `<math.h>`
* Reverted removal of the 3 conts
* Used `reshape` in `llm_graph_context::build_attn_mha()`
* Use `k_pe = ggml_reshape`
* Removed the 3 conts again
* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF
* Removed MQA optimisation from `build_attn_mha()` as no gains now
* Simplified `is_mla` branch in `llm_build_deepseek2()`
* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls
* Fixed call to `build_attn` in `llm_build_t5_enc`
2025-04-15 09:49:57 +03:00
Georgi Gerganov
3e1d29348b
kv-cache : simplify + fix warning for recurrent models ( #12756 )
...
ggml-ci
2025-04-04 21:48:10 +03:00
Diego Devesa
e0e912f49b
llama : add option to override model tensor buffers ( #11397 )
...
* llama : add option to override tensor buffers
* ggml : fix possible underflow in ggml_nbytes
2025-04-02 14:52:01 +02:00
Georgi Gerganov
a10b36c91a
llama : refactor kv cache guard ( #12695 )
...
* llama : refactor kv cache guard
ggml-ci
* cont : fix comment [no ci]
* llama : fix kv_cache restore logic
ggml-ci
* context : simplify kv cache updates
ggml-ci
* cont : better name [no ci]
* llama : fix llama_decode return code when could not find KV slot
ggml-ci
* context : change log err -> warn [no ci]
* kv-cache : add comment + warning
2025-04-02 14:32:59 +03:00
Xuan-Son Nguyen
af6ae1efb2
llama : fix non-causal mask for gemma 3 ( #12615 )
2025-03-30 00:07:37 +01:00