mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2025-08-29 11:39:14 -04:00
server : remove self-extend features (#9860)
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
This commit is contained in:
@@ -13,6 +13,10 @@ Feature: llama.cpp server
|
||||
And 32 as batch size
|
||||
And 2 slots
|
||||
|
||||
# the prompt is 301 tokens
|
||||
# the slot context is 256/2 = 128 tokens
|
||||
# the prompt is truncated to keep the last 109 tokens
|
||||
# 64 tokens are generated thanks to shifting the context when it gets full
|
||||
Scenario: Inference with context shift
|
||||
And 64 server max tokens to predict
|
||||
Then the server is starting
|
||||
|
Reference in New Issue
Block a user