server : remove self-extend features (#9860)
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
This commit is contained in:
parent
95c76e8e92
commit
1bde94dd02
4 changed files with 57 additions and 142 deletions
|
@ -13,6 +13,10 @@ Feature: llama.cpp server
|
|||
And 32 as batch size
|
||||
And 2 slots
|
||||
|
||||
# the prompt is 301 tokens
|
||||
# the slot context is 256/2 = 128 tokens
|
||||
# the prompt is truncated to keep the last 109 tokens
|
||||
# 64 tokens are generated thanks to shifting the context when it gets full
|
||||
Scenario: Inference with context shift
|
||||
And 64 server max tokens to predict
|
||||
Then the server is starting
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue