server : remove self-extend features (#9860)

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
2024-10-12 16:06:31 +03:00 · 2024-10-12 16:06:31 +03:00 · 1bde94dd02
commit 1bde94dd02
parent 95c76e8e92
4 changed files with 57 additions and 142 deletions
--- a/examples/server/tests/features/ctx_shift.feature
+++ b/examples/server/tests/features/ctx_shift.feature
@ -13,6 +13,10 @@ Feature: llama.cpp server
    And   32 as batch size
    And   2 slots

+    # the prompt is 301 tokens
+    # the slot context is 256/2 = 128 tokens
+    # the prompt is truncated to keep the last 109 tokens
+    # 64 tokens are generated thanks to shifting the context when it gets full
  Scenario: Inference with context shift
    And   64 server max tokens to predict
    Then  the server is starting