mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-27 03:55:20 +00:00

Files

Georgi Gerganov 12d0188c0d kv-cache : refactor + add llama_memory_state_i (#13746 )

* kv-cache : simplify the "struct llama_kv_cache" interface

ggml-ci

* kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

ggml-ci

* kv-cache : some comments

ggml-ci

* context : fix graph reserve for multiple sequences

ggml-ci

* kv-cache : fix typo [no ci]

* kv-cache : fix find_slot() logic for free slots

ggml-ci

* llama : add TODO for deprecating the defrag API in the future

* kv-cache : improve find_slot() using min/max seq pos info

ggml-ci

* llama : handle aborts and compute errors

ggml-ci

* memory : extract state into llama_memory_state

ggml-ci

* kv-cache : add comments

ggml-ci

* server : update batching logic to reset n_batch on successful decode

* server : upon full re-processing, remove the sequence from the cache

* kv-cache : add TODO for doing split_equal when split_simple fails

ggml-ci

2025-05-31 10:24:04 +03:00

CMakeLists.txt

ggml : move AMX to the CPU backend (#10570 )

2024-11-29 21:54:58 +01:00

parallel.cpp

kv-cache : refactor + add llama_memory_state_i (#13746 )

2025-05-31 10:24:04 +03:00

README.md

parallel : increase the variability of the prompt lengths (#13927 )

2025-05-30 19:38:07 +03:00

README.md

llama.cpp/example/parallel

Simplified simulation of serving incoming requests in parallel

Example

Generate 128 client requests (-ns 128), simulating 8 concurrent clients (-np 8). The system prompt is shared (-pps), meaning that it is computed once at the start. The client requests consist of up to 10 junk questions (--junk 10) followed by the actual question.

llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384

Note

It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.