|
e298d2fbd0
|
kv-cache : add SWA support (#13194)
* kv-cache : prepare for SWA
ggml-ci
* kv-cache : initial iSWA implementation
ggml-ci
* kv-cache : rework error recovery logic
ggml-ci
* models : fix Phi-3 SWA parameters
ggml-ci
* model : adjust Granite to rope factor changes
ggml-ci
* server : check if context can do shifts
ggml-ci
* iswa : for now, always enable shifts (experiment)
ggml-ci
* kv-cache : simplify SWA logic
ggml-ci
* kv-cache : apply defrag when we fail to find slots for the batch
ggml-ci
* llama : update docs about llama_decode
ggml-ci
* kv-cache : update warning logs when no space for the batch is available
ggml-ci
* llama : add llama_kv_self_seq_pos_min()
* kv-cache : keep track of partial SWA computes and print warnings
* server : disallow use cases involving partial SWA context
ggml-ci
* llama : add param to control SWA cache size
ggml-ci
* minor : clean-up
ggml-ci
|
2025-05-20 08:05:46 +03:00 |
|
|
6c8b91500e
|
llama-bench : fix -ot with dl backends (#13563)
|
2025-05-15 15:46:55 +02:00 |
|
|
b2838049cc
|
bench : handle decode errors (#13548)
ggml-ci
|
2025-05-15 05:57:02 +03:00 |
|
|
cf0a43bb64
|
llama-bench : add defrag-thold, check for invalid ranges (#13487)
|
2025-05-13 00:31:37 +02:00 |
|
|
22cdab343b
|
llama-bench : accept ranges for integer parameters (#13410)
|
2025-05-12 13:08:22 +02:00 |
|
|
7f323a589f
|
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (#13386)
|
2025-05-11 14:18:39 +02:00 |
|
|
1d36b3670b
|
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
|
2025-05-02 20:27:13 +02:00 |
|