af6ae1efb2
llama : fix non-causal mask for gemma 3 ( #12615 )
2025-03-30 00:07:37 +01:00
b4ae50810e
metal : improve FA + improve MoE ( #12612 )
...
* ggml : FA with different K, V head sizes (CPU)
ggml-ci
* metal : add FA with HS=192
* metal : extend FA to support different K and V head sizes
ggml-ci
* metal : add FA vector kernels for heads K 192 and V 128
ggml-ci
* ggml : restrict op on other backends to equal head sizes
ggml-ci
* metal : optimize FA-vec kernel
ggml-ci
* metal : FA remove mq registers
* metal : improve MoE mul_mat_id condition
ggml-ci
* metal : fix comments + remove unnecessary addition
ggml-ci
* metal : avoid too much shared memory usage with mul_mat_id
ggml-ci
2025-03-28 20:21:59 +02:00
2d77d88e70
context : fix worst-case reserve outputs ( #12545 )
...
ggml-ci
2025-03-25 09:19:23 +02:00
568013d0cd
context : clear sets containing encoder output sequence ids before storing new values ( #12470 )
...
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com >
2025-03-19 21:01:57 +01:00
75422e8bc4
graph : normalize Q, K, V shapes + sync cross attention ( #12449 )
...
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
2025-03-18 21:35:19 +02:00
8551c44d84
context : always use non-causal attention for encoder graphs ( #12447 )
...
* context : always use non-causal attention for encoder graphs
ggml-ci
* context : move the change to llama_context::encode()
ggml-ci
2025-03-18 13:05:49 +02:00
dc079cfdff
context : fix init of n_outputs ( #12397 )
...
ggml-ci
2025-03-16 19:29:36 +02:00
8fcb563613
Load all MoE experts during warmup ( #11571 )
...
* llama : introduce llama_set_warmup() API call that controls warmup mode; use all MoE experts during warmup
* common : use new API to enable warmup mode during model warmup
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com >
2025-03-14 13:47:05 +01:00
081bee8c64
hparams : add SWA rope parameters ( #12374 )
...
ggml-ci
2025-03-14 09:03:24 +02:00
84d5475541
llama : fix Gemma3 SWA KV cache shift ( #12373 )
...
* llama : fix Gemma3 SWA KV cache shift
ggml-ci
* hparams : add comment [no ci]
2025-03-13 19:08:07 +02:00
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )
...
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
2025-03-13 12:35:44 +02:00
afa8a9ec9b
llama : add llama_vocab
, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com >
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-01-12 11:32:42 +02:00
f66f582927
llama : refactor src/llama.cpp
( #10902 )
...
* llama : scatter llama.cpp into multiple modules (wip)
* llama : control-vector -> adapter
* llama : arch
* llama : mmap
ggml-ci
* ci : remove BUILD_SHARED_LIBS=OFF
ggml-ci
* llama : arch (cont)
ggml-ci
* llama : chat
ggml-ci
* llama : model
ggml-ci
* llama : hparams
ggml-ci
* llama : adapter
ggml-ci
* examples : fix
ggml-ci
* rebase
ggml-ci
* minor
* llama : kv cache
ggml-ci
* llama : impl
ggml-ci
* llama : batch
ggml-ci
* cont
ggml-ci
* llama : context
ggml-ci
* minor
* llama : context (cont)
ggml-ci
* llama : model loader
ggml-ci
* common : update lora
ggml-ci
* llama : quant
ggml-ci
* llama : quant (cont)
ggml-ci
* minor [no ci]
2025-01-03 10:18:53 +02:00