llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-08-15 12:42:40 -04:00

Author	SHA1	Message	Date
Georgi Gerganov	ba42794c9e	graph : fix equal_seq() check (#14986 ) ggml-ci	2025-08-01 06:38:12 +03:00
Daniel Bevenius	ca0ef2dddb	llama : clarify comment about pp and tg graphs [no ci] (#14895 ) * llama : clarify comment about pp and tg graphs [no ci] This commit clarifies the comment in `llama-context.cpp` regarding the prefill prompt (pp), and token generation (tg) graphs. The motivation for this is that I've struggled to remember these and had to look them up more than once, so I thought it would be helpful to add a comment that makes it clear what these stand for. * squash! llama : clarify comment about pp and tg graphs [no ci] Change "pp" to "prompt processing".	2025-07-27 12:10:51 +02:00
Georgi Gerganov	c1dbea752a	context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870 ) ggml-ci	2025-07-25 14:28:06 +03:00
Georgi Gerganov	e4868d16d2	context : perform output reorder lazily upon access after sync (#14853 ) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO	2025-07-24 16:31:48 +03:00
Georgi Gerganov	d498af3d5a	graph : avoid huge warm-up graphs for MoE models (#14753 ) * graph : avoid huge warm-up graphs for MoE models ggml-ci * cont : bump max nodes to 8x model tensors	2025-07-18 14:31:15 +03:00
Georgi Gerganov	8f974bc1e9	graph : refactor context to not pass gf explicitly (#14629 ) ggml-ci	2025-07-18 08:29:28 +03:00
Georgi Gerganov	01612b7409	llama : reuse compute graphs (#14482 ) * llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci	2025-07-17 19:08:33 +03:00
Georgi Gerganov	225e7a1438	llama : add high-throughput mode (#14363 ) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-16 16:35:42 +03:00
Aman Gupta	9c9e4fc635	llama-context: add ability to get logits (#14672 )	2025-07-14 21:01:41 +08:00
Georgi Gerganov	7b50d589a8	kv-cells : fix tracking of seq_pos (#14339 ) * kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments	2025-06-23 12:27:35 +03:00
Georgi Gerganov	692e3cdd0a	memory : rename interface to llama_memory_context_i (#14296 ) * memory : rename interface to llama_memory_context_i ggml-ci * cont : fix comments * cont : use "mctx" for referencing a memory context ggml-ci	2025-06-21 08:03:46 +03:00
Georgi Gerganov	4c9fdfbe15	ubatch : new splitting logic (#14217 ) ggml-ci	2025-06-20 10:14:14 +03:00
Georgi Gerganov	d3e64b9f49	llama : rework embeddings logic (#14208 ) * llama : rework embeddings logic ggml-ci * cont : fix rerank ggml-ci * cont : engrish [no ci] * cont : fix rerank ggml-ci * server : support both embeddings and completions with single model ggml-ci * cont : avoid embeddings_org ggml-ci	2025-06-16 14:14:00 +03:00
Georgi Gerganov	c311ac664d	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 ) ggml-ci	2025-06-15 10:08:58 +03:00
Georgi Gerganov	b9912ac570	batch : auto-gen positions + verify multi-sequence input (#14177 ) * batch : verify multi-sequence input batches ggml-ci * cont : auto-gen positions + verify multi-seq input ggml-ci * cont : first print debug info, then perform validation ggml-ci * cont : fix position auto-gen + add comments ggml-ci	2025-06-15 09:18:37 +03:00
Georgi Gerganov	60c666347b	batch : rework llama_batch_allocr (#14153 ) * batch : rework llama_batch_allocr ggml-ci * cont : move validation inside class ggml-ci * cont : move output counting to class ggml-ci * cont : minor ggml-ci * batch : add TODOs ggml-ci	2025-06-13 13:47:55 +03:00
Georgi Gerganov	f6e1a7aa87	context : simplify output counting logic during decode (#14142 ) * batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments	2025-06-12 11:50:01 +03:00
Georgi Gerganov	c3ee46fab4	batch : remove logits_all flag (#14141 ) ggml-ci	2025-06-12 11:49:26 +03:00
Georgi Gerganov	9596506965	kv-cache : fix split_equal handling in unified implementation (#14130 ) ggml-ci	2025-06-12 10:02:15 +03:00
compilade	a20b2b05bc	context : round n_tokens to next multiple of n_seqs when reserving (#14140 ) This fixes RWKV inference which otherwise failed when the worst case ubatch.n_seq_tokens rounded to 0.	2025-06-12 02:56:04 -04:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Georgi Gerganov	487a5e0401	context : fix SWA-related warning for multiple sequences (#14045 )	2025-06-06 13:29:18 +03:00
Sigbjørn Skjæret	d17a809ef0	llama : support multiple classifier outputs and labels (#13940 )	2025-06-06 09:03:25 +02:00
Georgi Gerganov	7f37b6cf1e	memory : migrate from llama_kv_cache to more generic llama_memory (#14006 ) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci	2025-06-05 15:29:22 +03:00
Georgi Gerganov	9e31bec4fd	context : fix pos_min initialization upon error decode (#14008 ) ggml-ci	2025-06-05 09:06:29 +03:00
Georgi Gerganov	3e63a58ef7	kv-cache : refactor the update/defrag mechanism (#13988 ) * kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci	2025-06-04 18:58:20 +03:00
Georgi Gerganov	803f8baf4f	llama : deprecate explicit kv_self defrag/update calls (#13921 ) ggml-ci	2025-05-31 15:58:33 +03:00
Georgi Gerganov	3600cc2886	llama : use n_swa + n_ubatch cells for SWA cache (#13833 ) * llama : use n_swa + n_ubatch cells for SWA cache ggml-ci * llama : add warning about multi-sqeuence SWA contexts	2025-05-31 15:57:44 +03:00
Georgi Gerganov	3f55f781f1	llama : auto-batch preparation (#13845 ) * llama : auto-batch ggml-ci * context : simplify if branching	2025-05-31 12:55:57 +03:00
Georgi Gerganov	12d0188c0d	kv-cache : refactor + add llama_memory_state_i (#13746 ) * kv-cache : simplify the "struct llama_kv_cache" interface ggml-ci * kv-cache : revert the (n_swa + n_ubatch) change (for next PR) ggml-ci * kv-cache : some comments ggml-ci * context : fix graph reserve for multiple sequences ggml-ci * kv-cache : fix typo [no ci] * kv-cache : fix find_slot() logic for free slots ggml-ci * llama : add TODO for deprecating the defrag API in the future * kv-cache : improve find_slot() using min/max seq pos info ggml-ci * llama : handle aborts and compute errors ggml-ci * memory : extract state into llama_memory_state ggml-ci * kv-cache : add comments ggml-ci * server : update batching logic to reset n_batch on successful decode * server : upon full re-processing, remove the sequence from the cache * kv-cache : add TODO for doing split_equal when split_simple fails ggml-ci	2025-05-31 10:24:04 +03:00
Georgi Gerganov	4f81b33e32	llama : validate seq id batch input (#13809 ) * llama : validate seq id batch input ggml-ci * cont : fix the fix ggml-ci	2025-05-27 09:40:59 +03:00
Georgi Gerganov	79c137f776	examples : allow extracting embeddings from decoder contexts (#13797 ) ggml-ci	2025-05-26 14:03:54 +03:00
Georgi Gerganov	de2ef53a4b	kv-cache : rework kv_cell (#13706 ) * kv-cache : rework kv_cell ggml-ci * kv-cells : use "shift" instead of "delta" consistently ggml-ci * llama : add llama_max_parallel_sequences() ggml-ci * kv-cells : update comments [no ci] * context : fail upon construction if sequences exceed max value ggml-ci * kv-cells : get_pos() -> pos_get() + comments ggml-ci * kv-cells : fix tracking of "used" cells ggml-ci	2025-05-25 16:34:36 +03:00
Georgi Gerganov	797f2ac062	kv-cache : simplify the interface (#13660 ) * kv-cache : simplify the interface ggml-ci * context : revert llama_batch_allocr position change ggml-ci	2025-05-21 15:11:13 +03:00
Georgi Gerganov	a4090d1174	llama : remove llama_kv_cache_view API + remove deprecated (#13653 ) ggml-ci	2025-05-20 16:13:16 +03:00
Georgi Gerganov	e298d2fbd0	kv-cache : add SWA support (#13194 ) * kv-cache : prepare for SWA ggml-ci * kv-cache : initial iSWA implementation ggml-ci * kv-cache : rework error recovery logic ggml-ci * models : fix Phi-3 SWA parameters ggml-ci * model : adjust Granite to rope factor changes ggml-ci * server : check if context can do shifts ggml-ci * iswa : for now, always enable shifts (experiment) ggml-ci * kv-cache : simplify SWA logic ggml-ci * kv-cache : apply defrag when we fail to find slots for the batch ggml-ci * llama : update docs about llama_decode ggml-ci * kv-cache : update warning logs when no space for the batch is available ggml-ci * llama : add llama_kv_self_seq_pos_min() * kv-cache : keep track of partial SWA computes and print warnings * server : disallow use cases involving partial SWA context ggml-ci * llama : add param to control SWA cache size ggml-ci * minor : clean-up ggml-ci	2025-05-20 08:05:46 +03:00
Sigbjørn Skjæret	f5170c1d7a	editorconfig : fix trailing whitespace from #13542 (#13546 )	2025-05-14 21:22:49 +03:00
Gilad S.	017f10b5fa	fix: crash when calling `llama_state_get_size` on a context without a KV cache (#13542 )	2025-05-14 19:18:18 +03:00
Johannes Gäßler	10d2af0eaa	llama/ggml: add LLM training support (#10544 ) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-12 14:44:49 +02:00
Georgi Gerganov	064cc596ac	context : fix state io for memory-less contexts (#13470 ) ggml-ci	2025-05-12 15:12:27 +03:00
David Huang	7f323a589f	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386 )	2025-05-11 14:18:39 +02:00
Georgi Gerganov	6562e5a4d6	context : allow cache-less context for embeddings (#13108 ) * context : allow cache-less context for embeddings ggml-ci * context : enable reranking with encode() ggml-ci * context : encode() clears embd_seq ggml-ci * examples : use llama_encode() when appropriate ggml-ci * models : nomic bert moe does not require KV cache * llama : update comments for llama_decode/llama_encode ggml-ci * context : update warning log [no ci]	2025-05-08 14:28:33 +03:00
Georgi Gerganov	51fb96b1ff	context : remove logits_all flag (#13284 ) * context : remove logits_all flag ggml-ci * llama : remove logits_all flag + reorder llama_context_params ggml-ci	2025-05-08 14:26:50 +03:00
Georgi Gerganov	a75cb30dc9	context : fix reorder logic (#13267 ) ggml-ci	2025-05-02 20:54:13 +03:00
Georgi Gerganov	c642bc014c	kv-cache : separate recurrent vs non-recurrent impl (#12799 ) * kv-cache : serparate recurrent vs non-recurrent impl (wip) ggml-ci * kv-cache : init -> contructor + add llama_memory_params ggml-ci * kv-cache : fix callback reference ggml-ci * context : llama_kv_cache -> llama_memory_i ggml-ci * context : move memory creation logic to model ggml-ci * llama : remove reference of memory during encode ggml-ci * kv-cache : hide padding details in the implementation ggml-ci * kv-cache : add ubatch_next() ggml-ci * context : simplify sbatch logic ggml-ci * kv-cache : hide defrag logic in the implementation ggml-ci * context : hide kv cache details in implementation ggml-ci * build : fix ggml-ci * cont : another fix ggml-ci * kv-cache : simplify interface (wip) ggml-ci * kv-cache : use separate KV cell structs for unified/recurrent ggml-ci * kv-cache : clean-up ggml-ci * model : better llama_model::create_model() signature ggml-ci * kv-cache : fix recurrent seq_rm() ggml-ci * kv-cache : replace `struct callbacks` with `llama_model &` ggml-ci * kv-cache : replace `struct graph_params` with `llama_context &` ggml-ci * kv-cache : fix offload check ggml-ci * context : avoid passing unique_ptr ggml-ci * kv-cache : avoid using the backends from the llama_context ref #13113 ggml-ci * kv-cache : more consistent debug logs [no ci] * kv-cache : do not pass the full llama_context for kv graphs ggml-ci * kv-cache : remove comment * kv-cache : ggml_rope_ext_inplace -> ggml_rope_ext ggml-ci * kv-cache : fix recurrent multi-user case ggml-ci * memory : remove comments [no ci]	2025-05-02 17:48:36 +03:00
ddh0	16a457facd	fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (#13221 )	2025-04-30 21:28:43 +01:00
pockers21	fb0471d175	context : do not clear output buffer on reserve (#13152 ) Co-authored-by: pockers21 <liyang2@uniontech.com>	2025-04-28 16:45:40 +03:00
Diego Devesa	295354ea68	llama : fix K-shift with quantized K and BLAS backend (#13113 )	2025-04-25 19:40:11 +02:00
Georgi Gerganov	2f74c354c0	graph : make FA compatible with MLA + add initial Metal kernels (#12953 ) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-17 18:16:36 +03:00
Juk Armstrong	daa422881a	llama : DeepSeek V2/V3 MLA implementation (#12801 ) * Merged using squash to remove all noise commit messages * Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large * Removed 3 conts (2x RoPE and 1x RMS-norm) * Changed to use `<cmath>` instead of `<math.h>` * Reverted removal of the 3 conts * Used `reshape` in `llm_graph_context::build_attn_mha()` * Use `k_pe = ggml_reshape` * Removed the 3 conts again * Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF * Removed MQA optimisation from `build_attn_mha()` as no gains now * Simplified `is_mla` branch in `llm_build_deepseek2()` * Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls * Fixed call to `build_attn` in `llm_build_t5_enc`	2025-04-15 09:49:57 +03:00

1 2

66 Commits