server : remove self-extend features (#9860)

* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
This commit is contained in:
Georgi Gerganov 2024-10-12 16:06:31 +03:00 committed by GitHub
parent 95c76e8e92
commit 1bde94dd02
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 57 additions and 142 deletions

View file

@ -13,6 +13,10 @@ Feature: llama.cpp server
And 32 as batch size
And 2 slots
# the prompt is 301 tokens
# the slot context is 256/2 = 128 tokens
# the prompt is truncated to keep the last 109 tokens
# 64 tokens are generated thanks to shifting the context when it gets full
Scenario: Inference with context shift
And 64 server max tokens to predict
Then the server is starting