mirror of https://github.com/ggml-org/llama.cpp.git synced 2025-06-26 19:55:04 +00:00

Files

Georgi Gerganov 745aa5319b llama : deprecate llama_kv_self_ API (#14030 )

* llama : deprecate llama_kv_self_ API

ggml-ci

* llama : allow llama_memory_(nullptr)

ggml-ci

* memory : add flag for optional data clear in llama_memory_clear

ggml-ci

2025-06-06 14:11:15 +03:00

CMakeLists.txt

ggml : move AMX to the CPU backend (#10570 )

2024-11-29 21:54:58 +01:00

parallel.cpp

llama : deprecate llama_kv_self_ API (#14030 )

2025-06-06 14:11:15 +03:00

README.md

parallel : increase the variability of the prompt lengths (#13927 )

2025-05-30 19:38:07 +03:00

README.md

llama.cpp/example/parallel

Simplified simulation of serving incoming requests in parallel

Example

Generate 128 client requests (-ns 128), simulating 8 concurrent clients (-np 8). The system prompt is shared (-pps), meaning that it is computed once at the start. The client requests consist of up to 10 junk questions (--junk 10) followed by the actual question.

llama-parallel -m model.gguf -np 8 -ns 128 --top-k 1 -pps --junk 10 -c 16384

Note

It's recommended to use base models with this example. Instruction tuned models might not be able to properly follow the custom chat template specified here, so the results might not be as expected.